Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I have been concentrating on word-level variants, it's true, because it's easiest for the computer to find meaningful differences when you break the texts down to their smallest meaningful constituent parts. For most Western languages, that's a word.

    The text I'm working on also has sentence-long (or paragraph-long, or in one case section-long) additions/deletions appearing in certain texts. (It also has word transpositions, which the MCE can detect, but which I haven't decided how to treat.) As far as th

    • I have been concentrating on word-level variants, it's true, because it's easiest for the computer to find meaningful differences when you break the texts down to their smallest meaningful constituent parts. For most Western languages, that's a word.

      Indeed, I would caution, however, that Celtic Languages have initial mutation such that grammatical meaning is encoded in the lenition or nasalization of the following word.

      I *can* envision a feature wherein the user defines a minimum word length for a "substantial" variant, and then for each "substantial" variant the editor program will look for similar lines elsewhere in the text and point them out. (The minimum length setting would be to prevent noise; you don't need the computer showing you where every instance of the word "and" is in a text, for example.) It would still be the user's (that is, the human editor's) job to note a definite correlation, in either the apparatus or the footnotes. Is that the sort of thing you'd be looking for?

      Well, for Old and Middle Irish, most of the variations in spelling are in the "Dictionary of the Irish Language based mainly on Old and Middle Irish Sources". The problem is that in Irish you have d for t spelling change, among other changes, as Old and Middle Irish mingle on the page so word length may not help in this case. It would

      • Indeed, I would caution, however, that Celtic Languages have initial mutation such that grammatical meaning is encoded in the lenition or nasalization of the following word.

        Yes, I should have been more clear. The word is the smallest meaningful difference that the computer can easily detect. Armenian also has grammatical meaning in suffixes, and a few prefixes, but for now the human still has to review those.

        Well, for Old and Middle Irish, most of the variations in spelling are in the "Dictionary of the Irish Language based mainly on Old and Middle Irish Sources". The problem is that in Irish you have d for t spelling change, among other changes, as Old and Middle Irish mingle on the page so word length may not help in this case. It would be easier to define the orthographic differences that may occur in variant spellings. So, a tree of variant spellings based on known change patterns would be of greater utility. I could be wrong and miss understanding you though.

        Again I was unclear; sorry about that. I was addressing large variants (e.g. your example of transplanted lines), so by "word length" I meant "number of words in variant" rather than "number of characters in word." So you might want to know if a contiguous set of, say, five words appears elsewhere in the text.

        Regarding spelling variations though, that's more or less what I've done. Have a look at Text::WagnerFischer::Armenian sometime. It's a module for calculating word edit distances for Armenian words; one thing I defined therein was acceptable orthographic variations. I also defined a few of the single-letter prefixes and suffixes that affect the grammatical meaning of the word. What I haven't found a way to do is handle case endings properly. The WagnerFischer algorithm is only good for comparing changes letter-by-letter, and not really good at all for finding multi-letter suffixes.

        Please do continue to raise these issues! It is important that I generalize as far as possible in the core modules, and not consider Armenian alone.