Those of you who were at my talk at YAPC might remember my mini-rant against XML. It's annoying to parse; the parsing libraries in Perl are among the more poorly-documented modules I've encountered; it seems in general to be one of those solutions that is over-engineered for any problem I encounter.
Well, last Thursday I spoke to a few guys from the Oxford Text Archive. The first frightening realization that I had to wrap my head around is that, for all the ways I naturally think in Perl, they think in XSLT.
Just...ponder that for a few minutes.
Here all this time I've thought of XML as, well, a "Markup Language". It has its uses, but basically I get uncomfortable with XML at the point where it stops being easily human-readable. It was, to say the least, odd to find a set of people who think of data as the basic building blocks of everything, and XML as a way to express these building blocks, and XSLT as a way to manipulate this building blocks in whatever way they need. It's like object orientation taken to its most frightening extreme.
So it turns out that the XML spec in question—the TEI guidelines—was thought up by a bunch of people who have taken a lot of feedback from scholars who work with texts of all kinds. There are chapters that could use more revision, sure, but basically the TEI XML spec has been informed by a bunch of people who are dealing with the problems I face and a lot more problems besides. As XML goes, it's a spec that's expressive enough for pretty much everything I might hope to encode about the text.
As it happens, I appreciated that fact already. I'd noticed that the TEI gave me a bunch of things to think about when transcribing a manuscript (abbreviations? marginal notes? catchwords? smaller-scale word corrections? abbreviation characters that appear in the manuscripts but aren't yet in Unicode? It's all there!) that I otherwise would have glossed over or interpreted without transcribing. But I was still thinking of it as a markup language—a standardized way of encoding information that might be useful to someone, someday, but not necessarily relevant to reading the words in the text and deriving enough meaning to compare it to other texts. Useful, to some extent, but not useful enough for my immediate problem (comparing the texts, which can reasonably be done word by word, without any meta-information) for me to bother with very deeply.
Meanwhile, a problem I have talked around in these blog posts but not addressed head on is that of data representation and storage. I have the information available in each manuscript; the problem I have not solved yet is "How do I represent that data? More importantly, how do I represent the decisions I make about the significance of that data?" It turns out that, not only can this be done within the TEI spec, but the spec allows for quite a lot of information (e.g. word divisions, morphological analysis—the ability to distinguish grammatically significant variations of words) that I've been looking for my own way to encode.
The upshot is, TEI XML makes it very easy and straightforward (well, for some definitions of "easy" and "straightforward"; I'll come back to this, probably in the next post) to mark and represent words, prefixes, suffixes, sectional divisions, marginal notes, and all sorts of stuff that may or may not prove to be significant. All I have to do is parse this information as it is given, rather than making heuristic guesses about how to derive it. I currently feed plaintext strings to my collator; there's no reason I can't feed regularized words based on the XML transcription.
Not only does TEI handle manuscript description; it also handles representation of critical editions. As I may have explained before, a critical edition generally presents a base text and an "apparatus", i.e. a specially-formatted block of footnotes, that contains the variations present in all the manuscripts of the text. From a data-representation point of view, the important thing here is that each word can be composed of a "lemma"—the base word—and its different "readings". Viewed that way, even the lemma is optional. A word can be composed of nothing but its variant readings.
And this is the first, easiest, thing my collator gives me. I make each "row" in my collation output into a list of readings, and write it out according to the TEI spec. When I'm ready to start editing, my program can read that file, present the options to me whenever there's more than one reading, and save my editing decisions back into the XML file. Then I can use pre-existing XSLT files to translate that result into LaTeX and printed text. This is particularly good, because as far as I'm concerned the only "good" form of XSLT is "XSLT that someone else has written and tested."
In short, other people have already thought about this problem, and I can use the fruits of their labor with only a very small abuse of their intentions. The only real cost is having to bash my head against libxml2.