Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

aurum (8572)

aurum
  (email not shown publicly)
http://www.eccentricity.org/

Ex-Akamaite, ex-Goldmanite. Currently working on Ph.D. in Armenian/Byzantine history at Oxford. Spends more time these days deciphering squiggly characters than spaghetti code. Thinks that UTF-8 is the best thing since sliced bread.

Journal of aurum (8572)

Saturday September 06, 2008
07:38 AM

XML: there's some good after all

[ #37372 ]

Those of you who were at my talk at YAPC might remember my mini-rant against XML. It's annoying to parse; the parsing libraries in Perl are among the more poorly-documented modules I've encountered; it seems in general to be one of those solutions that is over-engineered for any problem I encounter.

Well, last Thursday I spoke to a few guys from the Oxford Text Archive. The first frightening realization that I had to wrap my head around is that, for all the ways I naturally think in Perl, they think in XSLT.

Just...ponder that for a few minutes.

Here all this time I've thought of XML as, well, a "Markup Language". It has its uses, but basically I get uncomfortable with XML at the point where it stops being easily human-readable. It was, to say the least, odd to find a set of people who think of data as the basic building blocks of everything, and XML as a way to express these building blocks, and XSLT as a way to manipulate this building blocks in whatever way they need. It's like object orientation taken to its most frightening extreme.

So it turns out that the XML spec in question—the TEI guidelines—was thought up by a bunch of people who have taken a lot of feedback from scholars who work with texts of all kinds. There are chapters that could use more revision, sure, but basically the TEI XML spec has been informed by a bunch of people who are dealing with the problems I face and a lot more problems besides. As XML goes, it's a spec that's expressive enough for pretty much everything I might hope to encode about the text.

As it happens, I appreciated that fact already. I'd noticed that the TEI gave me a bunch of things to think about when transcribing a manuscript (abbreviations? marginal notes? catchwords? smaller-scale word corrections? abbreviation characters that appear in the manuscripts but aren't yet in Unicode? It's all there!) that I otherwise would have glossed over or interpreted without transcribing. But I was still thinking of it as a markup language—a standardized way of encoding information that might be useful to someone, someday, but not necessarily relevant to reading the words in the text and deriving enough meaning to compare it to other texts. Useful, to some extent, but not useful enough for my immediate problem (comparing the texts, which can reasonably be done word by word, without any meta-information) for me to bother with very deeply.

Meanwhile, a problem I have talked around in these blog posts but not addressed head on is that of data representation and storage. I have the information available in each manuscript; the problem I have not solved yet is "How do I represent that data? More importantly, how do I represent the decisions I make about the significance of that data?" It turns out that, not only can this be done within the TEI spec, but the spec allows for quite a lot of information (e.g. word divisions, morphological analysis—the ability to distinguish grammatically significant variations of words) that I've been looking for my own way to encode.

The upshot is, TEI XML makes it very easy and straightforward (well, for some definitions of "easy" and "straightforward"; I'll come back to this, probably in the next post) to mark and represent words, prefixes, suffixes, sectional divisions, marginal notes, and all sorts of stuff that may or may not prove to be significant. All I have to do is parse this information as it is given, rather than making heuristic guesses about how to derive it. I currently feed plaintext strings to my collator; there's no reason I can't feed regularized words based on the XML transcription.

Not only does TEI handle manuscript description; it also handles representation of critical editions. As I may have explained before, a critical edition generally presents a base text and an "apparatus", i.e. a specially-formatted block of footnotes, that contains the variations present in all the manuscripts of the text. From a data-representation point of view, the important thing here is that each word can be composed of a "lemma"—the base word—and its different "readings". Viewed that way, even the lemma is optional. A word can be composed of nothing but its variant readings.

And this is the first, easiest, thing my collator gives me. I make each "row" in my collation output into a list of readings, and write it out according to the TEI spec. When I'm ready to start editing, my program can read that file, present the options to me whenever there's more than one reading, and save my editing decisions back into the XML file. Then I can use pre-existing XSLT files to translate that result into LaTeX and printed text. This is particularly good, because as far as I'm concerned the only "good" form of XSLT is "XSLT that someone else has written and tested."

In short, other people have already thought about this problem, and I can use the fruits of their labor with only a very small abuse of their intentions. The only real cost is having to bash my head against libxml2.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • One of the things that I think is missing from TEI is XLink [w3.org]. A project that I called the "Critical Edition Browser" which would graphically show the connections between various recensions and copies of a text so that no one text is privileged over any other text (a classical critical edition set-up tends to do this). Basically, what I would want is two TEI encoded texts that have XLink arcs to each other in such a way as to show the lemma and stemma between the two (or more) texts. This would obviate the

  • The first frightening realization that I had to wrap my head around is that, for all the ways I naturally think in Perl, they think in XSLT.

    Just…ponder that for a few minutes.

    Nothing bizarre about that at all. :-) I can’t claim to be a decade-of-experience expert in XSLT as I can claim to be in Perl, but I am very good with the language, and I like it a whole lot. The syntax is dreadfully verbose, but at the semantic level – its computation model – it is extremely elegant. You can

    • I guess the thing I find frustrating about libxml2 is that I want a nice compact way of saying "Get me the one&only FOO child element from the one&only BAR element of the document." Am I missing something?

      It's also possible - moderately likely, even - that I'll convert my parsing to a SAX model, and that will make that particular frustration go away.

      However, the real problem I have with libxml2 at the moment is that it doesn't like the TEI RelaxNG files, and I don't know whose fault that is. It me

      • Am I missing something?

        Yes, XPath. Forget the DOM API, and for the most part, SAX as well.

        However, the real problem I have with libxml2 at the moment is that it doesn’t like the TEI RelaxNG files

        Ah, yes. The validation support in libxml2 is not all that great.

    • I may have asked this before, but is it XSLT you like or XPath? I've never managed to like XSLT, but I do like XPath. The syntax isn't always perfect, but I can't think of improvements.

      • Both. XPath isn’t dreadfully verbose; XSLT is. (It would greatly benefit from a non-XML rendition of its syntax, just like RelaxNG has both an XML and a Compact syntax.) But the basic model (recursive node visiting) is a perfect match for XSLT’s job. The apply-templates directive is basically a map with polymorphic callback using XPath-based dispatch. That’s all there is to XSLT.

        Of course, most people write for-each-heavy transforms instead, so they gain none of the elegance of this model.