Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

aurum (8572)

aurum
  (email not shown publicly)
http://www.eccentricity.org/

Ex-Akamaite, ex-Goldmanite. Currently working on Ph.D. in Armenian/Byzantine history at Oxford. Spends more time these days deciphering squiggly characters than spaghetti code. Thinks that UTF-8 is the best thing since sliced bread.

Journal of aurum (8572)

Thursday October 02, 2008
11:44 AM

the things made possible by TEI

[ #37592 ]

It's been a while since I've given any sort of status update on my collation project. I've spent most of the past few weeks writing the "conventional" half of my thesis, in which I have to prove that I can talk intelligently about medieval Armenian literature without hiding behind source code.

I have made some progress though. As of a week or two ago, I re-tooled my collation engine to work with plain-text input, trivial TEI input, and TEI input in which each word is marked up with the <w> tag. That last is important, because it means I no longer have to assume that words are whitespace-separated. Now, as long as you provide semantic markup to define "what is a word?", and you provide a canonization function for your script if necessary, the collation engine should be able to handle any text in any script at all.

(The canonization function is meant to, well, canonize the orthographic variants within a script so that the collator will trivially recognize them as the same word. So for Armenian, it means that the letter օ is the same as աւ, and the ligature և is the same as the two letters 'ե'+'ւ', and a few other things. Since I don't want to learn the rules for all human languages, I just leave a place for the user to provide a coderef to do this.)

As long as I was re-tooling my code, I also took the opportunity to try this "test-driven development" thing that seems to be all the best-practices rage at the moment. It certainly works to some extent—I have plenty of tests now, and find it very easy to run them every time I change some code—but as the project gets more complex, I'm finding it harder to have the patience to nail down the design and write the tests before I just plunge into the code.

Finally, as a reward for reading this far, I give you a TEI encoding (with commentary; watch carefully) of Bob Dylan's "Subterranean Homesick Blues". Well worth watching.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.