Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I have been concentrating on word-level variants, it's true, because it's easiest for the computer to find meaningful differences when you break the texts down to their smallest meaningful constituent parts. For most Western languages, that's a word.

    The text I'm working on also has sentence-long (or paragraph-long, or in one case section-long) additions/deletions appearing in certain texts. (It also has word transpositions, which the MCE can detect, but which I haven't decided how to treat.) As far as the MCE is concerned, a sentence-long addition in one manuscript counts as a bunch of NULs in all the other manuscripts. When I said in this post that "I need a way of smoothing chains of variant words into a single variant", this is what I was talking about.

    In your case, with a couple of verses that have been transplanted in a poem, the MCE would detect and record the differences on the word level. It is the job of the editing program to put those words back together into well-defined variants, and that's the part whose design I'm trying to hammer out now.

    To be honest, I think I could get the editing program to tell you two things, separately:
    - "text 3 has the two lines $foo / $bar where all the others have $baz / $quux";
    - "text 3 has the two lines $random / $other where all the others have $foo / $bar".

    It won't automatically make the connection between "$foo / $bar" in the two different spots of the text; for now, I'm leaving that to the human editor.

    I *can* envision a feature wherein the user defines a minimum word length for a "substantial" variant, and then for each "substantial" variant the editor program will look for similar lines elsewhere in the text and point them out. (The minimum length setting would be to prevent noise; you don't need the computer showing you where every instance of the word "and" is in a text, for example.) It would still be the user's (that is, the human editor's) job to note a definite correlation, in either the apparatus or the footnotes. Is that the sort of thing you'd be looking for?

    • I have been concentrating on word-level variants, it's true, because it's easiest for the computer to find meaningful differences when you break the texts down to their smallest meaningful constituent parts. For most Western languages, that's a word.

      Indeed, I would caution, however, that Celtic Languages have initial mutation such that grammatical meaning is encoded in the lenition or nasalization of the following word.

      I *can* envision a feature wherein the user defines a minimum word length for a "substantial" variant, and then for each "substantial" variant the editor program will look for similar lines elsewhere in the text and point them out. (The minimum length setting would be to prevent noise; you don't need the computer showing you where every instance of the word "and" is in a text, for example.) It would still be the user's (that is, the human editor's) job to note a definite correlation, in either the apparatus or the footnotes. Is that the sort of thing you'd be looking for?

      Well, for Old and Middle Irish, most of the variations in spelling are in the "Dictionary of the Irish Language based mainly on Old and Middle Irish Sources". The problem is that in Irish you have d for t spelling change, among other changes, as Old and Middle Irish mingle on the page so word length may not help in this case. It would

      • Indeed, I would caution, however, that Celtic Languages have initial mutation such that grammatical meaning is encoded in the lenition or nasalization of the following word.

        Yes, I should have been more clear. The word is the smallest meaningful difference that the computer can easily detect. Armenian also has grammatical meaning in suffixes, and a few prefixes, but for now the human still has to review those.

        Well, for Old and Middle Irish, most of the variations in spelling are in the "Dictionary of the Irish Language based mainly on Old and Middle Irish Sources". The problem is that in Irish you have d for t spelling change, among other changes, as Old and Middle Irish mingle on the page so word length may not help in this case. It would be easier to define the orthographic differences that may occur in variant spellings. So, a tree of variant spellings based on known change patterns would be of greater utility. I could be wrong and miss understanding you though.

        Again I was unclear; sorry about that. I was addressing large variants (e.g. your example of transplanted lines), so by "word length" I meant "number of words in variant" rather than "number of characters in word." So you might want to know if a contiguous set of, say,