It seems that I would much rather talk about software design issues than write this research proposal. Well, I may as well get something useful done.
So far, I have described the design of what I have been calling the "MCE", or the "manuscript collation engine". It works pretty well at this point, and when I run it on a bunch of transcribed text files, I get a bunch of arrays back, full of Word objects that are lined up neatly according to similarity and relative placement. Now I just have to use them. This is where I start speculating about what to do next.
I said at some point that I would talk about the structure of a Word object, but really there is little enough to tell. A Word is an object that will keep track of whatever I tell it to remember about a particular word—its matches, its variants, its original or normalized or canonicalized spelling, its punctuation, whether it should be considered the "base" word or a "variant" word or an "error".
Of course, many of the attributes I might want "remembered" can't actually be detected at collation time. Some of them are editing decisions, and others need the judgment of a human (or a set of rules) that understands things about the Armenian language. It's high time I wrote the editing interface.
(Nomenclature will be the death of this project, incidentally. It's bad enough that the computer world and the critical-text-edition world use the word "collation" differently. Now I want to write a program that, in the terminology of the humanities, ought to be called a "text editor." Great.)
So. I start with a bunch of arrays of words, and the superficial relationships between them. The end result should be a base text, and an apparatus that records the set of variants that I have judged to be worth recording. In the meantime, I should have had to do as little work as possible. This means several things:
But those are just the easy things. Two aspects of the problem are particularly tricky. The first is punctuation. The punctuation in Armenian manuscripts is all over the place. Do I mostly disregard it? Do I treat punctuation marks as words in their own right? Do I show it all on a case-by-case basis, and thereby give myself more work?
The second is the issue of partial-word readings. Remember that a "reading" is a "minimally distinctive unit of sense"; that means that a single word may contain multiple readings. Prefixes can have grammatical effects. For example:
The last is especially tricky, as it can either be written as the single word յաշխարհն, or as two separate words ի աշխարհն. If I am standardizing the orthography across manuscripts, I should separate the prefix յ, converting it to the preposition ի; I'll have to split the Word object, and align the resulting pair of Words with the Words in my other arrays. The alignment and word matching is a problem I have already solved with the MCE, but this means that the editing program will have to call back into the MCE to re-align the words in question.
As usual, I've launched into a whole raft of explanation, and not even asked for anyone's opinion on specific questions yet. Maybe tomorrow.