Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

aurum (8572)

aurum
  (email not shown publicly)
http://www.eccentricity.org/

Ex-Akamaite, ex-Goldmanite. Currently working on Ph.D. in Armenian/Byzantine history at Oxford. Spends more time these days deciphering squiggly characters than spaghetti code. Thinks that UTF-8 is the best thing since sliced bread.

Journal of aurum (8572)

Monday August 25, 2008
06:32 PM

moving on from collation

[ #37273 ]

It seems that I would much rather talk about software design issues than write this research proposal. Well, I may as well get something useful done.

So far, I have described the design of what I have been calling the "MCE", or the "manuscript collation engine". It works pretty well at this point, and when I run it on a bunch of transcribed text files, I get a bunch of arrays back, full of Word objects that are lined up neatly according to similarity and relative placement. Now I just have to use them. This is where I start speculating about what to do next.

I said at some point that I would talk about the structure of a Word object, but really there is little enough to tell. A Word is an object that will keep track of whatever I tell it to remember about a particular word—its matches, its variants, its original or normalized or canonicalized spelling, its punctuation, whether it should be considered the "base" word or a "variant" word or an "error".

Of course, many of the attributes I might want "remembered" can't actually be detected at collation time. Some of them are editing decisions, and others need the judgment of a human (or a set of rules) that understands things about the Armenian language. It's high time I wrote the editing interface.

(Nomenclature will be the death of this project, incidentally. It's bad enough that the computer world and the critical-text-edition world use the word "collation" differently. Now I want to write a program that, in the terminology of the humanities, ought to be called a "text editor." Great.)

So. I start with a bunch of arrays of words, and the superficial relationships between them. The end result should be a base text, and an apparatus that records the set of variants that I have judged to be worth recording. In the meantime, I should have had to do as little work as possible. This means several things:

  • I need to remember which word goes with which manuscript.
  • I need a way of marking a word as "base", that is, the accepted main reading.
  • I need a hierarchical series of categories of "variant", including but not limited to:
    • Grammatically sensical differences
    • Apparent grammatical errors
    • Orthography variations
  • I need to be able to "smooth" strings of variant words into a single variant.
  • I need a means of "teaching" the program about my decisions, so that I am never asked more than once about an orthographic variation.
  • I need a way of saving the decisions I've made.

But those are just the easy things. Two aspects of the problem are particularly tricky. The first is punctuation. The punctuation in Armenian manuscripts is all over the place. Do I mostly disregard it? Do I treat punctuation marks as words in their own right? Do I show it all on a case-by-case basis, and thereby give myself more work?

The second is the issue of partial-word readings. Remember that a "reading" is a "minimally distinctive unit of sense"; that means that a single word may contain multiple readings. Prefixes can have grammatical effects. For example:

  • աշխարհ (ashkharh): "land", but
  • աշխարհն (ashkharhn): "the land", and
  • յաշխարհն (yashkharhn): "into the land".

The last is especially tricky, as it can either be written as the single word յաշխարհն, or as two separate words ի աշխարհն. If I am standardizing the orthography across manuscripts, I should separate the prefix յ, converting it to the preposition ի; I'll have to split the Word object, and align the resulting pair of Words with the Words in my other arrays. The alignment and word matching is a problem I have already solved with the MCE, but this means that the editing program will have to call back into the MCE to re-align the words in question.

As usual, I've launched into a whole raft of explanation, and not even asked for anyone's opinion on specific questions yet. Maybe tomorrow.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I just read about your MCE and if I had something like that while doing my PhD, I would have been a godsend. Anyway, one thing I wanted to point out and something that happens in medieval Celtic Studies often is the situation where you have two texts which are both titled the same or are very much alike but are not exactly the same text even given variant spellings. For instance, in the edition that I have done, one of the scribes moved two of the lines from the original else where in the poem and filled

  • I have been concentrating on word-level variants, it's true, because it's easiest for the computer to find meaningful differences when you break the texts down to their smallest meaningful constituent parts. For most Western languages, that's a word.

    The text I'm working on also has sentence-long (or paragraph-long, or in one case section-long) additions/deletions appearing in certain texts. (It also has word transpositions, which the MCE can detect, but which I haven't decided how to treat.) As far as th

    • I have been concentrating on word-level variants, it's true, because it's easiest for the computer to find meaningful differences when you break the texts down to their smallest meaningful constituent parts. For most Western languages, that's a word.

      Indeed, I would caution, however, that Celtic Languages have initial mutation such that grammatical meaning is encoded in the lenition or nasalization of the following word.

      I *can* envision a feature wherein the user defines a minimum word length for a "substantial" variant, and then for each "substantial" variant the editor program will look for similar lines elsewhere in the text and point them out. (The minimum length setting would be to prevent noise; you don't need the computer showing you where every instance of the word "and" is in a text, for example.) It would still be the user's (that is, the human editor's) job to note a definite correlation, in either the apparatus or the footnotes. Is that the sort of thing you'd be looking for?

      Well, for Old and Middle Irish, most of the variations in spelling are in the "Dictionary of the Irish Language based mainly on Old and Middle Irish Sources". The problem is that in Irish you have d for t spelling change, among other changes, as Old and Middle Irish mingle on the page so word length may not help in this case. It would

      • Indeed, I would caution, however, that Celtic Languages have initial mutation such that grammatical meaning is encoded in the lenition or nasalization of the following word.

        Yes, I should have been more clear. The word is the smallest meaningful difference that the computer can easily detect. Armenian also has grammatical meaning in suffixes, and a few prefixes, but for now the human still has to review those.

        Well, for Old and Middle Irish, most of the variations in spelling are in the "Dictionary of the Irish Language based mainly on Old and Middle Irish Sources". The problem is that in Irish you have d for t spelling change, among other changes, as Old and Middle Irish mingle on the page so word length may not help in this case. It would be easier to define the orthographic differences that may occur in variant spellings. So, a tree of variant spellings based on known change patterns would be of greater utility. I could be wrong and miss understanding you though.

        Again I was unclear; sorry about that. I was addressing large variants (e.g. your example of transplanted lines), so by "word length" I meant "number of words in variant" rather than "number of characters in word." So you might want to know if a contiguous set of, say,