It's not every day that the Perl community gets acknowledged in a humanities thesis. Just thought I would make sure it was seen.
(The thesis itself will be made available after it's been examined and corrected.)
It's been a while since I've given any sort of status update on my collation project. I've spent most of the past few weeks writing the "conventional" half of my thesis, in which I have to prove that I can talk intelligently about medieval Armenian literature without hiding behind source code.
I have made some progress though. As of a week or two ago, I re-tooled my collation engine to work with plain-text input, trivial TEI input, and TEI input in which each word is marked up with the <w> tag. That last is important, because it means I no longer have to assume that words are whitespace-separated. Now, as long as you provide semantic markup to define "what is a word?", and you provide a canonization function for your script if necessary, the collation engine should be able to handle any text in any script at all.
(The canonization function is meant to, well, canonize the orthographic variants within a script so that the collator will trivially recognize them as the same word. So for Armenian, it means that the letter օ is the same as աւ, and the ligature և is the same as the two letters 'ե'+'ւ', and a few other things. Since I don't want to learn the rules for all human languages, I just leave a place for the user to provide a coderef to do this.)
As long as I was re-tooling my code, I also took the opportunity to try this "test-driven development" thing that seems to be all the best-practices rage at the moment. It certainly works to some extent—I have plenty of tests now, and find it very easy to run them every time I change some code—but as the project gets more complex, I'm finding it harder to have the patience to nail down the design and write the tests before I just plunge into the code.
Finally, as a reward for reading this far, I give you a TEI encoding (with commentary; watch carefully) of Bob Dylan's "Subterranean Homesick Blues". Well worth watching.
I have been in Paris this week, at the conference of the Association Internationale des Études Arméniennes. I gave another version of the talk I gave at YAPC. Since this conference was more serious, my slides are somewhat more useful as standalone information, and so I've put them online. Enjoy.
(Yes I am caught up in the Eurostar mess. I don't know yet how I'm getting back to the UK. I'll find out tomorrow.)
Today I released the first small piece of the Collation Project. (Yes, I have another research proposal I ought to be writing. Yes, I spent hours today writing documentation and formalizing tests. What's your point?)
This piece addresses the problem that is efficient transcription of manuscripts. It is my weird idea of a markup language for TEI XML. As an added bonus for people who aren't me, it exports a function to take an existing TEI XML file (well, string), parse it, wrap all the whitespace-separated words in
<w/> ("word") tags, and return the new file. Identifying the words is, after all, step one in efficient word collation.
This also means that my collator should be able to handle pretty much any language or writing system, as long as the basic unit of meaning that ought to be collated is enclosed within a
<w/> tag. When it's done, of course.
This also means that I am going to need a module name for the collator soon. Suggestions?
Those of you who were at my talk at YAPC might remember my mini-rant against XML. It's annoying to parse; the parsing libraries in Perl are among the more poorly-documented modules I've encountered; it seems in general to be one of those solutions that is over-engineered for any problem I encounter.
Well, last Thursday I spoke to a few guys from the Oxford Text Archive. The first frightening realization that I had to wrap my head around is that, for all the ways I naturally think in Perl, they think in XSLT.
Just...ponder that for a few minutes.
Here all this time I've thought of XML as, well, a "Markup Language". It has its uses, but basically I get uncomfortable with XML at the point where it stops being easily human-readable. It was, to say the least, odd to find a set of people who think of data as the basic building blocks of everything, and XML as a way to express these building blocks, and XSLT as a way to manipulate this building blocks in whatever way they need. It's like object orientation taken to its most frightening extreme.
So it turns out that the XML spec in question—the TEI guidelines—was thought up by a bunch of people who have taken a lot of feedback from scholars who work with texts of all kinds. There are chapters that could use more revision, sure, but basically the TEI XML spec has been informed by a bunch of people who are dealing with the problems I face and a lot more problems besides. As XML goes, it's a spec that's expressive enough for pretty much everything I might hope to encode about the text.
As it happens, I appreciated that fact already. I'd noticed that the TEI gave me a bunch of things to think about when transcribing a manuscript (abbreviations? marginal notes? catchwords? smaller-scale word corrections? abbreviation characters that appear in the manuscripts but aren't yet in Unicode? It's all there!) that I otherwise would have glossed over or interpreted without transcribing. But I was still thinking of it as a markup language—a standardized way of encoding information that might be useful to someone, someday, but not necessarily relevant to reading the words in the text and deriving enough meaning to compare it to other texts. Useful, to some extent, but not useful enough for my immediate problem (comparing the texts, which can reasonably be done word by word, without any meta-information) for me to bother with very deeply.
Meanwhile, a problem I have talked around in these blog posts but not addressed head on is that of data representation and storage. I have the information available in each manuscript; the problem I have not solved yet is "How do I represent that data? More importantly, how do I represent the decisions I make about the significance of that data?" It turns out that, not only can this be done within the TEI spec, but the spec allows for quite a lot of information (e.g. word divisions, morphological analysis—the ability to distinguish grammatically significant variations of words) that I've been looking for my own way to encode.
The upshot is, TEI XML makes it very easy and straightforward (well, for some definitions of "easy" and "straightforward"; I'll come back to this, probably in the next post) to mark and represent words, prefixes, suffixes, sectional divisions, marginal notes, and all sorts of stuff that may or may not prove to be significant. All I have to do is parse this information as it is given, rather than making heuristic guesses about how to derive it. I currently feed plaintext strings to my collator; there's no reason I can't feed regularized words based on the XML transcription.
Not only does TEI handle manuscript description; it also handles representation of critical editions. As I may have explained before, a critical edition generally presents a base text and an "apparatus", i.e. a specially-formatted block of footnotes, that contains the variations present in all the manuscripts of the text. From a data-representation point of view, the important thing here is that each word can be composed of a "lemma"—the base word—and its different "readings". Viewed that way, even the lemma is optional. A word can be composed of nothing but its variant readings.
And this is the first, easiest, thing my collator gives me. I make each "row" in my collation output into a list of readings, and write it out according to the TEI spec. When I'm ready to start editing, my program can read that file, present the options to me whenever there's more than one reading, and save my editing decisions back into the XML file. Then I can use pre-existing XSLT files to translate that result into LaTeX and printed text. This is particularly good, because as far as I'm concerned the only "good" form of XSLT is "XSLT that someone else has written and tested."
In short, other people have already thought about this problem, and I can use the fruits of their labor with only a very small abuse of their intentions. The only real cost is having to bash my head against libxml2.
I now have a script which produces output that looks like this. Each capital letter represents a manuscript. (OK, so in real life the words are lined up in columns, but I can't make use.perl play nicely with Unicode characters inside an <ecode> tag, which is the one that would preserve spacing.)
Word variation! Context:
մինչեւ ցայս վայրս բազմաջան եւ եւ աշխատաւոր քննութեամբ գտեալ գրեցաք >> ի զշարագրական գրեալս զհարիւրից ամաց, զորս ի << բազում ժամանակաց հետա հետաքննեալ հասու եղաք։ ընդ այնքանեաց տեսողացն եւ
Base ի զշարագրական գրեալս զհարիւրից ամաց, զորս ի
ABH: զշարագրական գրեալս զհարիւրից ամաց զորս ի
G: զշարագրական գրեալսն հարիւրից ամաց զորս ի
C: ի ժամանակական գրեալս հարիւրից ամացն զորս
J: զշարագրական գրեալս զճից ամաց զոր ի
DFI: զշարագրական գրեալս զճից ամաց զորս ի
E: զշարագրական գրեալս զճ ամաց զորս ի
Of course it doesn't take any input yet. One thing at a time.
Hey there's a thought.
Maybe I should flesh out some more design of this beast I'm writing, and then organize a hackathon.
It seems that I would much rather talk about software design issues than write this research proposal. Well, I may as well get something useful done.
So far, I have described the design of what I have been calling the "MCE", or the "manuscript collation engine". It works pretty well at this point, and when I run it on a bunch of transcribed text files, I get a bunch of arrays back, full of Word objects that are lined up neatly according to similarity and relative placement. Now I just have to use them. This is where I start speculating about what to do next.
I said at some point that I would talk about the structure of a Word object, but really there is little enough to tell. A Word is an object that will keep track of whatever I tell it to remember about a particular word—its matches, its variants, its original or normalized or canonicalized spelling, its punctuation, whether it should be considered the "base" word or a "variant" word or an "error".
Of course, many of the attributes I might want "remembered" can't actually be detected at collation time. Some of them are editing decisions, and others need the judgment of a human (or a set of rules) that understands things about the Armenian language. It's high time I wrote the editing interface.
(Nomenclature will be the death of this project, incidentally. It's bad enough that the computer world and the critical-text-edition world use the word "collation" differently. Now I want to write a program that, in the terminology of the humanities, ought to be called a "text editor." Great.)
So. I start with a bunch of arrays of words, and the superficial relationships between them. The end result should be a base text, and an apparatus that records the set of variants that I have judged to be worth recording. In the meantime, I should have had to do as little work as possible. This means several things:
But those are just the easy things. Two aspects of the problem are particularly tricky. The first is punctuation. The punctuation in Armenian manuscripts is all over the place. Do I mostly disregard it? Do I treat punctuation marks as words in their own right? Do I show it all on a case-by-case basis, and thereby give myself more work?
The second is the issue of partial-word readings. Remember that a "reading" is a "minimally distinctive unit of sense"; that means that a single word may contain multiple readings. Prefixes can have grammatical effects. For example:
The last is especially tricky, as it can either be written as the single word յաշխարհն, or as two separate words ի աշխարհն. If I am standardizing the orthography across manuscripts, I should separate the prefix յ, converting it to the preposition ի; I'll have to split the Word object, and align the resulting pair of Words with the Words in my other arrays. The alignment and word matching is a problem I have already solved with the MCE, but this means that the editing program will have to call back into the MCE to re-align the words in question.
As usual, I've launched into a whole raft of explanation, and not even asked for anyone's opinion on specific questions yet. Maybe tomorrow.
Now that I've explained this much of my design, I am going to have to apologize for confusing my readers, because I'm about to overload a term. (This just goes to show how terrible I am at nomenclature when programming.)
Anyone who has ever looked at a critical edition of any text will see the word variant tossed about casually, alongside the word reading, under the assumption that both of these words are self-explanatory. I have told the manuscript collation engine (MCE) that a "variant" is any word that is not a "match" to its base word. Now I'm going to have to tell the rest of my program something else.
From the point of view of the reader of a critical edition, a "variant" is a chunk of text (be it one word or several) that does not appear in the base text, but appears in the apparatus below the base text. An apparatus is a specially formatted block of footnotes below the main text of an edition. The footnotes encode information about the "variant readings" found in manuscripts. Peter Robinson, who worked on this sort of thing long before I did, defined a "reading" as a "minimally distinctive unit of sense". (ref) A "unit of sense" is usually a word, but could be, say, half of a compound word. On the other hand, when you have several "units of sense" lined up in one sentence in one manuscript, it is most efficent and understandable to present them as a single reading. The "variant" readings, therefore, are the ones that vary from the base text.
So you may begin to see the problem. I've been talking about a variant as if it were always a single word, always defined in reference to the first text in which I saw any word at all in that position, and always significantly different from the first word encountered. This is great for a first pass of difference detection, but if I published that information unmodified my edition would be incomprehensible rubbish.
As an editor, therefore, a variant is defined in relationship to what I've chosen as the "right" word, and it is any difference at all within reason. And then I have to define what my bounds of reason are. Those bounds are often defined arbitrarily by the editor. One editor may choose, due to space constraints, to publish only those variants which he judges "substantial". Another may choose to publish all variants that aren't simply orthographic variations. Another may choose to publish all variants that make some sort of grammatical sense, and omit the ungrammatical ones. Editions of the New Testament usually include everything, because minute differences have a huge impact upon theological study. There are a few online editions appearing; they tend to include everything, because space constraint is not an issue.
What's more, Armenian is an inflected language. The same word can have a different grammatical meaning with a different suffix. The MCE will record the two words as a fuzzy match, but in fact I am going to have to review them and decide whether this "match" represents a sensical variant, a nonsensical (ungrammatical, or misspelled) variant, or simply a variation in orthography.
In fact, the only reason I told the MCE to pay attention to "variants" in the first place is to make my editing job easier in the future. It is useful for me to only have to consider the "similar" words together, and for the computer to reserve the "different" words in the same position for separate consideration. The MCE is only the core of the larger editing program I need, and that editing program must be able to learn from my decisions. That is, if I mark հա՛ոց as an orthographic variation of հայոց in one place, I should not be asked again about that pair of words. This will not only save me a lot of trouble; it will allow me to construct a more consistent, and therefore better, edition.
It was brought to my attention in a comment on my last post that I didn't do a very good job describing the relationships between words that I create. I'll try to fix this here.
It's really difficult to construct good examples in English, incidentally; we don't have a lot of prefixes or suffixes or case endings, so pretend for the moment that the samples I give in this post are all grammatically valid. (Don't make me break out the lolcat.) That said, given an example set of texts:
Tara has a lot of books about languages.
Tara had alot book to do with languages.
Tera got a lot of book to do with languages.
the collator would line them up thus, as I described previously:
0 1 2 3 4 5 6 7 8 9
A) Tara has a lot of books about languages.
B) Tara had alot book to do with languages.
C) Tera got a lot of book too do with languages.
The base text generated from this would then be:
Tara has a lot of books about do with languages.
Since each word in the base text comes from the top, it is this word that contains linkage information for all the other words. So for this base text we would have:
-> FUZZYMATCH: Tera
-> FUZZYMATCH: had
-> VARIANT: got
-> FUZZYMATCH: alot
-> FUZZYMATCH: book
-> VARIANT: to
This does not, however, list every unique word that appears in every column of the texts above. For that, I need to also record the relationship between "to" and "too" in column 6. When the collator finds "too", and fails to find a match with "about", it will look through the list of variants attached to about, find "to", and add "too" as a FUZZYMATCH for it. So the relevant snippet of data structure becomes
-> VARIANT: to
-> FUZZYMATCH: too)
I appear to have been waylaid by a cat, and anyway I've taken up a lot of screen space by drawing out datastructures, so I'll continue tomorrow.