Here I will describe the basic design outline, as it currently stands, for my manuscript text collation engine, a.k.a. the "Armenian Difference Engine." (Later I will ask you to argue with each other about a module name, but not now. Call it the MCE for now.) I welcome, indeed I solicit, feedback and opinions as to how I might do things more cleverly.
This is being posted without external proofreading, so if something isn't clear, please ask!
So what's the problem?
The editor of a historical text begins with some number of manuscript copies of that text. The goal of the exercise is to compare each of these manuscript texts against each other, note the variations among them, and choose the best "reading" out of each of the alternatives wherever a variation occurs. Until recently, this was done manually; what I am building is a program that will calculate the variations and only consult me when I need to exert editing control—that is, when I need to choose the best reading.
OK, sure. So how does your program work then?
Each manuscript text is passed in as a filehandle (or as a string), containing a bunch of words. For my purposes, the text I am passing in has punctuation and orthography variation as represented in the manuscript, but I have expanded the abbreviations. (Eventually I will accept TEI XML and parse that into a string of text; doing that will probably make my life easier writing this program in the same measure that it makes my life more difficult in transcribing and handling conversion to XML.)
The first step is to transform each "text" into words. Words are non-whitespace characters, separated by whitespace. (Yes that means that, right now, I only support scripts that demarcate words with whitespace.) Each word gets a representation as an MCE::Word object, which I will describe in my next post. Now I have several arrays of Words, where each array represents a manuscript text. In theory, I could create an MCE::Text object with information about the manuscript itself and a chain of words all linked together, but I haven't yet concluded that a simple array is fragile enough to justify the overhead of another OO package. I may later change my mind.
Now I have two or more arrays, probably all slightly different lengths. I pull out the string representations of each Word from the first two arrays, and pass them to Algorithm::Diff. Diff can return three answers for any given chunk of text:
The &align_and_match_words subroutine takes two (generally relatively short) arrays of Word objects, which may be varying lengths. It compares each word in one array to each word in the second array to find the "best" match. If, for example, you send representations of two strings to this subroutine:
This is a short sentence.
This is an even longer sentence with more words.
your result is:
0 1 2 3 4 5 6 7 8
This is a short NUL sentence NUL NUL NUL
This is an even longer sentence with more words.
(Note that this is an illustration only; in practice, these two strings would not be sent in toto to the subroutine, because Algorithm::Diff would only report a "changed" chunk for the substrings "a short" and "an even longer.")
The subroutine will detect a fuzzy match between "a" and "an" in column 2, and add the Word object "a" to the list of "fuzzymatch"es attached to the Word object "an". It will find no similarity between the words "short" and "even" in column 4, so will add the Word object for "even" to the list of variants attached to the Word object "short". It will pad the remaining empty spaces with an empty Word; the empty Word is never linked to anything. All "fuzzymatch" and "variant" linkage relations work from top to bottom; that is, given two texts, the first text always contains the links.
The top-to-bottom association of links becomes important when more than two texts are compared. To begin the next comparison, I call the &generate_base subroutine on the two texts whose comparison has just finished. This subroutine is fairly simple; it looks for the topmost non-NUL word in all the arrays it has passed. In our example above, the new base text generated would be
This is a short longer sentence with more words.
Semantically useless, but a good way to generate pegs on which to hang words. The word "a" has a fuzzymatch link to "an", and the word "short" has a variant link to "even". All the identical words that occur in the same column are also linked. This newly generated "base" becomes the topmost text in the next comparison.
At the end of the run, then, we have an array of Words to represent each of the manuscript texts we passed in. The arrays are padded with empty (NUL) words where necessary, so that all the arrays are the same length, and all their same / similar words are aligned in the same row. If the user calls &generate_base on the complete set of result arrays, he/she will have an array of non-NUL words, each of which contain the appropriate links to non-NUL words in the manuscripts that come after it. And then the real work can start.
In the next few posts, I will say more about the concept of a "variant", talk about the structure of Word objects, and discuss the as-yet unsolved problem of punctuation.
(I did promise a more cheerful post, and at the moment, my alternative to writing this is to read more articles about the 15th century eastern Mediterranean, and try to come up with a research proposal, whose success or failure will determine whether I get to continue to have an academic career. No pressure or anything.)
So YAPC this year was a blast, just like last year. I spent much of Day 1 vaguely out of sorts as I generally do ("who do I want to say hello to and haven't found yet? What am I missing? What are the cool kids doing that no one has told me about?" etc.) but that had thankfully passed by the end of the day. Day 2 was overshadowed by rising worry about my presentation ("will I remember what I want to say? Will I run embarrassingly over? Will it make any sense?") which subsided for the duration of the conference dinner and came back in force on the morning of day 3. The presentation went well though; I got a lot of compliments and a few good questions, and was mostly unable to eat my lunch due to being kept talking about various things. I'll take that last as a sign of success.
What I will probably do here in this journal, over the next little while, is go over the data model and algorithm model of my collation engine—basically all the technical bits of my project that were left out of my presentation for being heavy on explanation and light on humor. Some of you will probably have better ideas than I would about the various data models, and many of you will have opinions about DB design (since I haven't yet implemented data persistence.) Lots of you will be way better at algorithms than I am. With any luck I can even get a few of you to volunteer opinions on user interfaces, which is the next best thing to releasing the core module on CPAN and letting someone else write a UI.
The "Armenian studies" version of this presentation will be on 11 September in Paris, so I'm hoping to make a little more progress on the code before then, though I'll still need time to make the presentation a little more palatable to a very different audience. (Really sorry, domm, but the yak will have to go. They just wouldn't get it.)
So it turns out that Nicholas keeps posting things here to which I want to comment; thus, I finally had to create an account here. Hello, world.
I've just come back from a trip that included visits to perl mongers in Vienna and Amsterdam; I blogged things about that trip elsewhere.
The real reason I'm posting, however, is to talk about (natural) languages, which has become something of a habit for me. In this post, Nicholas was looking for a Latinate word for the (bad, evil, wrong, no one should do it ever, etc.) act of killing a kitten. Now many people know that a cat is a Felis domesticus, so cat-killing would be "felicide". Damian pointed out, and I concur, that the obvious word for kitten would be "felis" (or "feles", the proper nominative spelling) with a diminutive suffix, which gives "feliculus". All this would give us "feliculicide", which works, but is kind of an unwieldy mouthful.
But it turns out that there is another word in Latin that can mean cat: coincidentally enough, that word is "catus" or "cattus". This word (along with its Greek cousin "κάττα") has a direct (though not necessarily causative—the source for this word is very hard to pin down) relationship to the German "Katze" and the English "cat"; Cassell's English -> Latin section actually gives "catulus" for "kitten". On the other hand, Lewis & Short think that "cattus" is a word for an unknown sort of animal, and that although "catulus" can be the young of any animal, it especially refers to puppies. But "catulicide" is easier for English-speakers to say that "feliculicide", with its uncomfortable extra syllable.
Really I think this is just an example of Latin as an evolving language. Lewis & Short aren't particularly keen on admitting as proper Latin anything that wasn't attested before the end of the Roman Republic, or thereabouts. I think it rather likely, though I couldn't tell everything I needed to from the Thesaurus Linguae Latinae, that "cattus" was the usual medieval word for "cat", and that "catulicide" (the word, not the act itself, mind) will serve just fine.