Here I will describe the basic design outline, as it currently stands, for my manuscript text collation engine, a.k.a. the "Armenian Difference Engine." (Later I will ask you to argue with each other about a module name, but not now. Call it the MCE for now.) I welcome, indeed I solicit, feedback and opinions as to how I might do things more cleverly.
This is being posted without external proofreading, so if something isn't clear, please ask!
So what's the problem?
The editor of a historical text begins with some number of manuscript copies of that text. The goal of the exercise is to compare each of these manuscript texts against each other, note the variations among them, and choose the best "reading" out of each of the alternatives wherever a variation occurs. Until recently, this was done manually; what I am building is a program that will calculate the variations and only consult me when I need to exert editing control—that is, when I need to choose the best reading.
OK, sure. So how does your program work then?
Each manuscript text is passed in as a filehandle (or as a string), containing a bunch of words. For my purposes, the text I am passing in has punctuation and orthography variation as represented in the manuscript, but I have expanded the abbreviations. (Eventually I will accept TEI XML and parse that into a string of text; doing that will probably make my life easier writing this program in the same measure that it makes my life more difficult in transcribing and handling conversion to XML.)
The first step is to transform each "text" into words. Words are non-whitespace characters, separated by whitespace. (Yes that means that, right now, I only support scripts that demarcate words with whitespace.) Each word gets a representation as an MCE::Word object, which I will describe in my next post. Now I have several arrays of Words, where each array represents a manuscript text. In theory, I could create an MCE::Text object with information about the manuscript itself and a chain of words all linked together, but I haven't yet concluded that a simple array is fragile enough to justify the overhead of another OO package. I may later change my mind.
Now I have two or more arrays, probably all slightly different lengths. I pull out the string representations of each Word from the first two arrays, and pass them to Algorithm::Diff. Diff can return three answers for any given chunk of text:
The &align_and_match_words subroutine takes two (generally relatively short) arrays of Word objects, which may be varying lengths. It compares each word in one array to each word in the second array to find the "best" match. If, for example, you send representations of two strings to this subroutine:
This is a short sentence.
This is an even longer sentence with more words.
your result is:
0 1 2 3 4 5 6 7 8
This is a short NUL sentence NUL NUL NUL
This is an even longer sentence with more words.
(Note that this is an illustration only; in practice, these two strings would not be sent in toto to the subroutine, because Algorithm::Diff would only report a "changed" chunk for the substrings "a short" and "an even longer.")
The subroutine will detect a fuzzy match between "a" and "an" in column 2, and add the Word object "a" to the list of "fuzzymatch"es attached to the Word object "an". It will find no similarity between the words "short" and "even" in column 4, so will add the Word object for "even" to the list of variants attached to the Word object "short". It will pad the remaining empty spaces with an empty Word; the empty Word is never linked to anything. All "fuzzymatch" and "variant" linkage relations work from top to bottom; that is, given two texts, the first text always contains the links.
The top-to-bottom association of links becomes important when more than two texts are compared. To begin the next comparison, I call the &generate_base subroutine on the two texts whose comparison has just finished. This subroutine is fairly simple; it looks for the topmost non-NUL word in all the arrays it has passed. In our example above, the new base text generated would be
This is a short longer sentence with more words.
Semantically useless, but a good way to generate pegs on which to hang words. The word "a" has a fuzzymatch link to "an", and the word "short" has a variant link to "even". All the identical words that occur in the same column are also linked. This newly generated "base" becomes the topmost text in the next comparison.
At the end of the run, then, we have an array of Words to represent each of the manuscript texts we passed in. The arrays are padded with empty (NUL) words where necessary, so that all the arrays are the same length, and all their same / similar words are aligned in the same row. If the user calls &generate_base on the complete set of result arrays, he/she will have an array of non-NUL words, each of which contain the appropriate links to non-NUL words in the manuscripts that come after it. And then the real work can start.
In the next few posts, I will say more about the concept of a "variant", talk about the structure of Word objects, and discuss the as-yet unsolved problem of punctuation.