aurum's Journal http://use.perl.org/~aurum/journal/ aurum's use Perl Journal en-us use Perl; is Copyright 1998-2006, Chris Nandor. Stories, comments, journals, and other submissions posted on use Perl; are Copyright their respective owners. 2012-01-25T02:27:53+00:00 pudge pudge@perl.org Technology hourly 1 1970-01-01T00:00+00:00 aurum's Journal http://use.perl.org/images/topics/useperl.gif http://use.perl.org/~aurum/journal/ public thanks http://use.perl.org/~aurum/journal/38499?from=rss <p>It's not every day that the Perl community <a href="http://www.eccentricity.org/2009/02/making_my_gratitude_public.html">gets acknowledged</a> in a humanities thesis. Just thought I would make sure it was seen.</p><p>(The thesis itself will be made available after it's been examined and corrected.)</p> aurum 2009-02-17T13:50:31+00:00 journal the things made possible by TEI http://use.perl.org/~aurum/journal/37592?from=rss <p>It's been a while since I've given any sort of status update on my collation project. I've spent most of the past few weeks writing the "conventional" half of my thesis, in which I have to prove that I can talk intelligently about medieval Armenian literature without hiding behind source code.</p><p>I have made some progress though. As of a week or two ago, I re-tooled my collation engine to work with plain-text input, trivial TEI input, and TEI input in which each word is marked up with the &lt;w&gt; tag. That last is important, because it means I no longer have to assume that words are whitespace-separated. Now, as long as you provide semantic markup to define "what is a word?", and you provide a canonization function for your script if necessary, the collation engine should be able to handle any text in any script at all.</p><p>(The canonization function is meant to, well, canonize the orthographic variants within a script so that the collator will trivially recognize them as the same word. So for Armenian, it means that the letter &#1413; is the same as &#1377;&#1410;, and the ligature &#1415; is the same as the two letters '&#1381;'+'&#1410;', and a few other things. Since I don't want to learn the rules for all human languages, I just leave a place for the user to provide a coderef to do this.)</p><p>As long as I was re-tooling my code, I also took the opportunity to try this "test-driven development" thing that seems to be all the best-practices rage at the moment. It certainly works to some extent&#8212;I have plenty of tests now, and find it very easy to run them every time I change some code&#8212;but as the project gets more complex, I'm finding it harder to have the patience to nail down the design and write the tests before I just plunge into the code.</p><p>Finally, as a reward for reading this far, I give you <a href="http://www.youtube.com/watch?v=4sHYDfITjHY">a TEI encoding</a> (with commentary; watch carefully) of Bob Dylan's "Subterranean Homesick Blues". Well worth watching.</p> aurum 2008-10-02T16:44:51+00:00 journal slides online http://use.perl.org/~aurum/journal/37429?from=rss <p>I have been in Paris this week, at the conference of the Association Internationale des &#201;tudes Arm&#233;niennes. I gave another version of the talk I gave at YAPC. Since this conference was more serious, my slides are somewhat more useful as standalone information, and so I've <a href="http://www.slideshare.net/tla/101011-manuscripts-approaches-to-the-digitisation-of-the-chronicle-of-matthew-of-edessa-presentation">put them online.</a> Enjoy.</p><p>(Yes I am caught up in the Eurostar mess. I don't know yet how I'm getting back to the UK. I'll find out tomorrow.)</p> aurum 2008-09-12T23:05:36+00:00 journal cpan module #3 http://use.perl.org/~aurum/journal/37376?from=rss <p>Today I released <a href="http://search.cpan.org/~aurum/Text-TEI-Markup-1.0/lib/Text/TEI/Markup.pm">the first small piece</a> of the Collation Project. (Yes, I have another research proposal I ought to be writing. Yes, I spent hours today writing documentation and formalizing tests. What's your point?)</p><p>This piece addresses the problem that is efficient transcription of manuscripts. It is my weird idea of a markup language for <a href="http://www.tei-c.org/">TEI XML</a>. As an added bonus for people who aren't me, it exports a function to take an existing TEI XML file (well, string), parse it, wrap all the whitespace-separated words in <code>&lt;w/&gt;</code> ("word") tags, and return the new file. Identifying the words is, after all, step one in efficient word collation.</p><p>This also means that my collator should be able to handle pretty much any language or writing system, as long as the basic unit of meaning that ought to be collated is enclosed within a <code>&lt;w/&gt;</code> tag. When it's done, of course.</p><p>This also means that I am going to need a module name for the collator soon. Suggestions?</p> aurum 2008-09-06T22:56:23+00:00 journal XML: there's some good after all http://use.perl.org/~aurum/journal/37372?from=rss <p>Those of you who were at my talk at YAPC might remember my mini-rant against XML. It's annoying to parse; the parsing libraries in Perl are among the more poorly-documented modules I've encountered; it seems in general to be one of those solutions that is over-engineered for any problem I encounter.</p><p>Well, last Thursday I spoke to a few guys from the Oxford Text Archive. The first frightening realization that I had to wrap my head around is that, for all the ways I naturally think in Perl, they think in XSLT.</p><p>Just...ponder that for a few minutes.</p><p>Here all this time I've thought of XML as, well, a "Markup Language". It has its uses, but basically I get uncomfortable with XML at the point where it stops being easily human-readable. It was, to say the least, odd to find a set of people who think of data as the basic building blocks of everything, and XML as a way to express these building blocks, and XSLT as a way to manipulate this building blocks in whatever way they need. It's like object orientation taken to its most frightening extreme.</p><p>So it turns out that the XML spec in question&#8212;the <a href="http://www.tei-c.org/">TEI guidelines</a>&#8212;was thought up by a bunch of people who have taken a lot of feedback from scholars who work with texts of all kinds. There are chapters that could use more revision, sure, but basically the TEI XML spec has been informed by a bunch of people who are dealing with the problems I face and a lot more problems besides. As XML goes, it's a spec that's expressive enough for pretty much everything I might hope to encode about the text.</p><p>As it happens, I appreciated that fact already. I'd noticed that the TEI gave me a bunch of things to think about when transcribing a manuscript (abbreviations? marginal notes? catchwords? smaller-scale word corrections? abbreviation characters that appear in the manuscripts but aren't yet in Unicode? It's all there!) that I otherwise would have glossed over or interpreted without transcribing. But I was still thinking of it as a markup language&#8212;a standardized way of encoding information that might be useful to someone, someday, but not necessarily relevant to reading the words in the text and deriving enough meaning to compare it to other texts. Useful, to some extent, but not useful enough for my immediate problem (comparing the texts, which can reasonably be done word by word, without any meta-information) for me to bother with very deeply.</p><p>Meanwhile, a problem I have talked around in these blog posts but not addressed head on is that of data representation and storage. I have the information available in each manuscript; the problem I have not solved yet is "How do I represent that data? More importantly, how do I represent the decisions I make about the significance of that data?" It turns out that, not only can this be done within the TEI spec, but the spec allows for quite a lot of information (e.g. word divisions, morphological analysis&#8212;the ability to distinguish grammatically significant variations of words) that I've been looking for my own way to encode.</p><p>The upshot is, TEI XML makes it very easy and straightforward (well, for some definitions of "easy" and "straightforward"; I'll come back to this, probably in the next post) to mark and represent words, prefixes, suffixes, sectional divisions, marginal notes, and all sorts of stuff that may or may not prove to be significant. All I have to do is parse this information as it is given, rather than making heuristic guesses about how to derive it. I currently feed plaintext strings to my collator; there's no reason I can't feed regularized words based on the XML transcription.</p><p>Not only does TEI handle manuscript description; it also handles representation of critical editions. As I may have explained before, a critical edition generally presents a base text and an "apparatus", i.e. a specially-formatted block of footnotes, that contains the variations present in all the manuscripts of the text. From a data-representation point of view, the important thing here is that each word can be composed of a "lemma"&#8212;the base word&#8212;and its different "readings". Viewed that way, even the lemma is optional. A word can be composed of nothing but its variant readings.</p><p>And this is the first, easiest, thing my collator gives me. I make each "row" in my collation output into a list of readings, and write it out according to the TEI spec. When I'm ready to start editing, my program can read that file, present the options to me whenever there's more than one reading, and save my editing decisions back into the XML file. Then I can use pre-existing XSLT files to translate that result into LaTeX and printed text. This is particularly good, because as far as I'm concerned the only "good" form of XSLT is "XSLT that someone else has written and tested."</p><p>In short, other people have already thought about this problem, and I can use the fruits of their labor with only a very small abuse of their intentions. The only real cost is having to bash my head against libxml2.</p> aurum 2008-09-06T12:38:44+00:00 journal Progress, of sorts http://use.perl.org/~aurum/journal/37325?from=rss <p>I now have a script which produces output that looks like this. Each capital letter represents a manuscript. (OK, so in real life the words are lined up in columns, but I can't make use.perl play nicely with Unicode characters inside an &lt;ecode&gt; tag, which is the one that would preserve spacing.)</p><p><code><br>Word variation! Context:<br>&#1396;&#1387;&#1398;&#1401;&#1381;&#1410; &#1409;&#1377;&#1397;&#1405; &#1406;&#1377;&#1397;&#1408;&#1405; &#1378;&#1377;&#1382;&#1396;&#1377;&#1403;&#1377;&#1398; &#1381;&#1410; &#1381;&#1410; &#1377;&#1399;&#1389;&#1377;&#1407;&#1377;&#1410;&#1400;&#1408; &#1412;&#1398;&#1398;&#1400;&#1410;&#1385;&#1381;&#1377;&#1396;&#1378; &#1379;&#1407;&#1381;&#1377;&#1388; &#1379;&#1408;&#1381;&#1409;&#1377;&#1412; &gt;&gt; &#1387; &#1382;&#1399;&#1377;&#1408;&#1377;&#1379;&#1408;&#1377;&#1391;&#1377;&#1398; &#1379;&#1408;&#1381;&#1377;&#1388;&#1405; &#1382;&#1392;&#1377;&#1408;&#1387;&#1410;&#1408;&#1387;&#1409; &#1377;&#1396;&#1377;&#1409;, &#1382;&#1400;&#1408;&#1405; &#1387; &lt;&lt; &#1378;&#1377;&#1382;&#1400;&#1410;&#1396; &#1386;&#1377;&#1396;&#1377;&#1398;&#1377;&#1391;&#1377;&#1409; &#1392;&#1381;&#1407;&#1377; &#1392;&#1381;&#1407;&#1377;&#1412;&#1398;&#1398;&#1381;&#1377;&#1388; &#1392;&#1377;&#1405;&#1400;&#1410; &#1381;&#1394;&#1377;&#1412;&#1417; &#1384;&#1398;&#1380; &#1377;&#1397;&#1398;&#1412;&#1377;&#1398;&#1381;&#1377;&#1409; &#1407;&#1381;&#1405;&#1400;&#1394;&#1377;&#1409;&#1398; &#1381;&#1410;</code></p><p><code>Base &#1387; &#1382;&#1399;&#1377;&#1408;&#1377;&#1379;&#1408;&#1377;&#1391;&#1377;&#1398; &#1379;&#1408;&#1381;&#1377;&#1388;&#1405; &#1382;&#1392;&#1377;&#1408;&#1387;&#1410;&#1408;&#1387;&#1409; &#1377;&#1396;&#1377;&#1409;, &#1382;&#1400;&#1408;&#1405; &#1387;<br>----<br>ABH: &#1382;&#1399;&#1377;&#1408;&#1377;&#1379;&#1408;&#1377;&#1391;&#1377;&#1398; &#1379;&#1408;&#1381;&#1377;&#1388;&#1405; &#1382;&#1392;&#1377;&#1408;&#1387;&#1410;&#1408;&#1387;&#1409; &#1377;&#1396;&#1377;&#1409; &#1382;&#1400;&#1408;&#1405; &#1387;<br>G: &#1382;&#1399;&#1377;&#1408;&#1377;&#1379;&#1408;&#1377;&#1391;&#1377;&#1398; &#1379;&#1408;&#1381;&#1377;&#1388;&#1405;&#1398; &#1392;&#1377;&#1408;&#1387;&#1410;&#1408;&#1387;&#1409; &#1377;&#1396;&#1377;&#1409; &#1382;&#1400;&#1408;&#1405; &#1387;<br>C: &#1387; &#1386;&#1377;&#1396;&#1377;&#1398;&#1377;&#1391;&#1377;&#1391;&#1377;&#1398; &#1379;&#1408;&#1381;&#1377;&#1388;&#1405; &#1392;&#1377;&#1408;&#1387;&#1410;&#1408;&#1387;&#1409; &#1377;&#1396;&#1377;&#1409;&#1398; &#1382;&#1400;&#1408;&#1405;<br>J: &#1382;&#1399;&#1377;&#1408;&#1377;&#1379;&#1408;&#1377;&#1391;&#1377;&#1398; &#1379;&#1408;&#1381;&#1377;&#1388;&#1405; &#1382;&#1395;&#1387;&#1409; &#1377;&#1396;&#1377;&#1409; &#1382;&#1400;&#1408; &#1387;<br>DFI: &#1382;&#1399;&#1377;&#1408;&#1377;&#1379;&#1408;&#1377;&#1391;&#1377;&#1398; &#1379;&#1408;&#1381;&#1377;&#1388;&#1405; &#1382;&#1395;&#1387;&#1409; &#1377;&#1396;&#1377;&#1409; &#1382;&#1400;&#1408;&#1405; &#1387;<br>E: &#1382;&#1399;&#1377;&#1408;&#1377;&#1379;&#1408;&#1377;&#1391;&#1377;&#1398; &#1379;&#1408;&#1381;&#1377;&#1388;&#1405; &#1382;&#1395; &#1377;&#1396;&#1377;&#1409; &#1382;&#1400;&#1408;&#1405; &#1387;<br></code></p><p>Of course it doesn't take any input yet. One thing at a time.</p> aurum 2008-09-02T03:35:11+00:00 journal exploitation of the masses http://use.perl.org/~aurum/journal/37297?from=rss <p>Hey there's a thought.</p><p>Maybe I should flesh out some more design of this beast I'm writing, and then organize a hackathon.</p> aurum 2008-08-28T22:30:32+00:00 journal moving on from collation http://use.perl.org/~aurum/journal/37273?from=rss <p>It seems that I would much rather talk about software design issues than write this research proposal. Well, I may as well get something useful done.</p><p>So far, I have described the design of what I have been calling the "MCE", or the "manuscript collation engine". It works pretty well at this point, and when I run it on a bunch of transcribed text files, I get a bunch of arrays back, full of Word objects that are lined up neatly according to similarity and relative placement. Now I just have to use them. This is where I start speculating about what to do next.</p><p>I said at some point that I would talk about the structure of a Word object, but really there is little enough to tell. A Word is an object that will keep track of whatever I tell it to remember about a particular word&#8212;its matches, its variants, its original or normalized or canonicalized spelling, its punctuation, whether it should be considered the "base" word or a "variant" word or an "error".</p><p>Of course, many of the attributes I might want "remembered" can't actually be detected at collation time. Some of them are editing decisions, and others need the judgment of a human (or a set of rules) that understands things about the Armenian language. It's high time I wrote the editing interface.</p><p>(Nomenclature will be the death of this project, incidentally. It's bad enough that the computer world and the critical-text-edition world use the word "collation" differently. Now I want to write a program that, in the terminology of the humanities, ought to be called a "text editor." Great.)</p><p>So. I start with a bunch of arrays of words, and the superficial relationships between them. The end result should be a base text, and an apparatus that records the set of variants that I have judged to be worth recording. In the meantime, I should have had to do as little work as possible. This means several things: </p><ul> <li>I need to remember which word goes with which manuscript.</li><li>I need a way of marking a word as "base", that is, the accepted main reading.</li><li>I need a hierarchical series of categories of "variant", including but not limited to: <ul> <li>Grammatically sensical differences</li><li>Apparent grammatical errors</li><li>Orthography variations</li></ul></li> <li>I need to be able to "smooth" strings of variant words into a single variant.</li><li>I need a means of "teaching" the program about my decisions, so that I am never asked more than once about an orthographic variation.</li><li>I need a way of saving the decisions I've made.</li></ul><p>But those are just the easy things. Two aspects of the problem are particularly tricky. The first is punctuation. The punctuation in Armenian manuscripts is all over the place. Do I mostly disregard it? Do I treat punctuation marks as words in their own right? Do I show it all on a case-by-case basis, and thereby give myself more work?</p><p>The second is the issue of partial-word readings. Remember that a "reading" is a "minimally distinctive unit of sense"; that means that a single word may contain multiple readings. Prefixes can have grammatical effects. For example: </p><ul> <li>&#1377;&#1399;&#1389;&#1377;&#1408;&#1392; (ashkharh): "land", but </li><li>&#1377;&#1399;&#1389;&#1377;&#1408;&#1392;&#1398; (ashkharhn): "the land", and </li><li>&#1397;&#1377;&#1399;&#1389;&#1377;&#1408;&#1392;&#1398; (yashkharhn): "into the land".</li></ul><p> The last is especially tricky, as it can either be written as the single word &#1397;&#1377;&#1399;&#1389;&#1377;&#1408;&#1392;&#1398;, or as two separate words &#1387; &#1377;&#1399;&#1389;&#1377;&#1408;&#1392;&#1398;. If I am standardizing the orthography across manuscripts, I should separate the prefix &#1397;, converting it to the preposition &#1387;; I'll have to split the Word object, and align the resulting pair of Words with the Words in my other arrays. The alignment and word matching is a problem I have already solved with the MCE, but this means that the editing program will have to call back into the MCE to re-align the words in question.</p><p>As usual, I've launched into a whole raft of explanation, and not even asked for anyone's opinion on specific questions yet. Maybe tomorrow.</p> aurum 2008-08-25T23:32:19+00:00 journal variations on "variants" http://use.perl.org/~aurum/journal/37265?from=rss <p>Now that I've explained this much of my design, I am going to have to apologize for confusing my readers, because I'm about to overload a term. (This just goes to show how terrible I am at nomenclature when programming.)</p><p>Anyone who has ever looked at a critical edition of any text will see the word <i>variant</i> tossed about casually, alongside the word <i>reading</i>, under the assumption that both of these words are self-explanatory. I have told the manuscript collation engine (MCE) that a "variant" is any word that is not a "match" to its base word. Now I'm going to have to tell the rest of my program something else.</p><p>From the point of view of the reader of a critical edition, a "variant" is a chunk of text (be it one word or several) that does not appear in the base text, but appears in the <i>apparatus</i> below the base text. An apparatus is a specially formatted block of footnotes below the main text of an edition. The footnotes encode information about the "variant readings" found in manuscripts. <a href="http://www.theology.bham.ac.uk/staff/robinson.htm">Peter Robinson</a>, who worked on this sort of thing long before I did, defined a "reading" as a "minimally distinctive unit of sense". <a href="http://llc.oxfordjournals.org/cgi/content/abstract/4/3/174">(ref)</a> A "unit of sense" is usually a word, but could be, say, half of a compound word. On the other hand, when you have several "units of sense" lined up in one sentence in one manuscript, it is most efficent and understandable to present them as a single reading. The "variant" readings, therefore, are the ones that vary from the base text.</p><p>So you may begin to see the problem. I've been talking about a variant as if it were always a single word, always defined in reference to the first text in which I saw any word at all in that position, and always significantly different from the first word encountered. This is great for a first pass of difference detection, but if I published that information unmodified my edition would be incomprehensible rubbish.</p><p>As an editor, therefore, a variant is defined in relationship to what I've chosen as the "right" word, and it is <i>any difference at all</i> within reason. And then I have to define what my bounds of reason are. Those bounds are often defined arbitrarily by the editor. One editor may choose, due to space constraints, to publish only those variants which he judges "substantial". Another may choose to publish all variants that aren't simply orthographic variations. Another may choose to publish all variants that make some sort of grammatical sense, and omit the ungrammatical ones. Editions of the New Testament usually include everything, because minute differences have a huge impact upon theological study. There are a few online editions appearing; they tend to include everything, because space constraint is not an issue.</p><p>What's more, Armenian is an inflected language. The same word can have a different grammatical meaning with a different suffix. The MCE will record the two words as a fuzzy match, but in fact I am going to have to review them and decide whether this "match" represents a sensical variant, a nonsensical (ungrammatical, or misspelled) variant, or simply a variation in orthography.</p><p>In fact, the only reason I told the MCE to pay attention to "variants" in the first place is to make my editing job easier in the future. It is useful for me to only have to consider the "similar" words together, and for the computer to reserve the "different" words in the same position for separate consideration. The MCE is only the core of the larger editing program I need, and that editing program must be able to learn from my decisions. That is, if I mark &#1392;&#1377;&#1371;&#1400;&#1409; as an orthographic variation of &#1392;&#1377;&#1397;&#1400;&#1409; in one place, I should not be asked again about that pair of words. This will not only save me a lot of trouble; it will allow me to construct a more consistent, and therefore better, edition.</p> aurum 2008-08-24T23:11:12+00:00 journal more on word relationships http://use.perl.org/~aurum/journal/37261?from=rss <p>It was brought to my attention in a comment on my last post that I didn't do a very good job describing the relationships between words that I create. I'll try to fix this here.</p><p>It's really difficult to construct good examples in English, incidentally; we don't have a lot of prefixes or suffixes or case endings, so pretend for the moment that the samples I give in this post are all grammatically valid. (Don't make me break out the lolcat.) That said, given an example set of texts:</p><blockquote><div><p> <tt>Tara has a lot of books about languages.<br>Tara had alot book to do with languages.<br>Tera got a lot of book to do with languages.</tt></p></div> </blockquote><p>the collator would line them up thus, as I described previously:</p><blockquote><div><p> <tt>&nbsp; &nbsp;0&nbsp; &nbsp; 1&nbsp; &nbsp;2 3&nbsp; &nbsp; 4&nbsp; 5&nbsp; &nbsp; &nbsp;6&nbsp; &nbsp; &nbsp;7&nbsp; 8&nbsp; &nbsp; 9<br>A) Tara has a lot&nbsp; of books about&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;languages.<br>B) Tara had&nbsp; &nbsp;alot&nbsp; &nbsp; book&nbsp; to&nbsp; &nbsp; do with languages.<br>C) Tera got a lot&nbsp; of book&nbsp; too&nbsp; &nbsp;do with languages.</tt></p></div> </blockquote><p>The base text generated from this would then be:</p><blockquote><div><p> <tt>Tara has a lot of books about do with languages.</tt></p></div> </blockquote><p>Since each word in the base text comes from the top, it is this word that contains linkage information for all the other words. So for this base text we would have:</p><blockquote><div><p> <tt>Tara<br> -&gt;&nbsp; &nbsp; FUZZYMATCH: Tera<br>has<br> -&gt;&nbsp; &nbsp; FUZZYMATCH: had<br> -&gt;&nbsp; &nbsp; VARIANT: got<br>a<br>lot<br> -&gt;&nbsp; &nbsp; FUZZYMATCH: alot<br>of<br>books<br> -&gt;&nbsp; &nbsp; FUZZYMATCH: book<br>about<br> -&gt;&nbsp; &nbsp; VARIANT: to<br>do<br>with<br>languages</tt></p></div> </blockquote><p>This does not, however, list every unique word that appears in every column of the texts above. For that, I need to also record the relationship between "to" and "too" in column 6. When the collator finds "too", and fails to find a match with "about", it will look through the list of variants attached to about, find "to", and add "too" as a FUZZYMATCH for it. So the relevant snippet of data structure becomes</p><blockquote><div><p> <tt>about<br> -&gt; VARIANT: to<br>(to<br> -&gt; FUZZYMATCH: too)<br>do<br>...</tt></p></div> </blockquote><p>I appear to have been waylaid by a cat, and anyway I've taken up a lot of screen space by drawing out datastructures, so I'll continue tomorrow.</p> aurum 2008-08-23T23:33:23+00:00 journal Text collation engine: design overview http://use.perl.org/~aurum/journal/37243?from=rss <p>Here I will describe the basic design outline, as it currently stands, for my manuscript text collation engine, a.k.a. the "Armenian Difference Engine." (Later I will ask you to argue with each other about a module name, but not now. Call it the MCE for now.) I welcome, indeed I solicit, feedback and opinions as to how I might do things more cleverly.</p><p>This is being posted without external proofreading, so if something isn't clear, please ask!</p><p> <b>So what's the problem?</b> </p><p>The editor of a historical text begins with some number of manuscript copies of that text. The goal of the exercise is to compare each of these manuscript texts against each other, note the variations among them, and choose the best "reading" out of each of the alternatives wherever a variation occurs. Until recently, this was done manually; what I am building is a program that will calculate the variations and only consult me when I need to exert editing control&#8212;that is, when I need to choose the best reading.</p><p> <b>OK, sure. So how does your program work then?</b> </p><p>Each manuscript text is passed in as a filehandle (or as a string), containing a bunch of words. For my purposes, the text I am passing in has punctuation and orthography variation as represented in the manuscript, but I have expanded the abbreviations. (Eventually I will accept TEI XML and parse that into a string of text; doing that will probably make my life easier writing this program in the same measure that it makes my life more difficult in transcribing and handling conversion to XML.)</p><p>The first step is to transform each "text" into words. Words are non-whitespace characters, separated by whitespace. (Yes that means that, right now, I only support scripts that demarcate words with whitespace.) Each word gets a representation as an MCE::Word object, which I will describe in my next post. Now I have several arrays of Words, where each array represents a manuscript text. In theory, I could create an MCE::Text object with information about the manuscript itself and a chain of words all linked together, but I haven't yet concluded that a simple array is fragile enough to justify the overhead of another OO package. I may later change my mind.</p><p>Now I have two or more arrays, probably all slightly different lengths. I pull out the string representations of each Word from the first two arrays, and pass them to Algorithm::Diff. Diff can return three answers for any given chunk of text:</p><ul> <li>The chunks are the same. Pass them through, and link them as being the same word.</li><li>One of the chunks is zero-length (addition or deletion.) Pass the non-zero chunk through, and pad the text which contains the zero-length chunk with empty Words. (Actually the same empty Word, to save space.)</li><li>The chunks are not the same. Call &amp;align_and_match_words on the two chunks.</li></ul><p>The &amp;align_and_match_words subroutine takes two (generally relatively short) arrays of Word objects, which may be varying lengths. It compares each word in one array to each word in the second array to find the "best" match. If, for example, you send representations of two strings to this subroutine:</p><p> <code>This is a short sentence.<br> This is an even longer sentence with more words.</code> </p><p>your result is:</p><blockquote><div><p> <tt>0&nbsp; &nbsp; 1&nbsp; 2&nbsp; 3&nbsp; &nbsp; &nbsp;4&nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; &nbsp; 7&nbsp; &nbsp; 8<br>This is a&nbsp; short NUL&nbsp; &nbsp; sentence NUL&nbsp; NUL&nbsp; NUL<br>This is an even&nbsp; longer sentence with more words.</tt></p></div> </blockquote><p>(Note that this is an illustration only; in practice, these two strings would not be sent <i>in toto</i> to the subroutine, because Algorithm::Diff would only report a "changed" chunk for the substrings "a short" and "an even longer.")</p><p>The subroutine will detect a fuzzy match between "a" and "an" in column 2, and add the Word object "a" to the list of "fuzzymatch"es attached to the Word object "an". It will find no similarity between the words "short" and "even" in column 4, so will add the Word object for "even" to the list of variants attached to the Word object "short". It will pad the remaining empty spaces with an empty Word; the empty Word is never linked to anything. All "fuzzymatch" and "variant" linkage relations work from top to bottom; that is, given two texts, the first text always contains the links.</p><p>The top-to-bottom association of links becomes important when more than two texts are compared. To begin the next comparison, I call the &amp;generate_base subroutine on the two texts whose comparison has just finished. This subroutine is fairly simple; it looks for the topmost non-NUL word in all the arrays it has passed. In our example above, the new base text generated would be</p><p> <code>This is a short longer sentence with more words.</code> </p><p>Semantically useless, but a good way to generate pegs on which to hang words. The word "a" has a fuzzymatch link to "an", and the word "short" has a variant link to "even". All the identical words that occur in the same column are also linked. This newly generated "base" becomes the topmost text in the next comparison.</p><p>At the end of the run, then, we have an array of Words to represent each of the manuscript texts we passed in. The arrays are padded with empty (NUL) words where necessary, so that all the arrays are the same length, and all their same / similar words are aligned in the same row. If the user calls &amp;generate_base on the complete set of result arrays, he/she will have an array of non-NUL words, each of which contain the appropriate links to non-NUL words in the manuscripts that come after it. And then the real work can start.</p><p>In the next few posts, I will say more about the concept of a "variant", talk about the structure of Word objects, and discuss the as-yet unsolved problem of punctuation.</p> aurum 2008-08-21T13:49:45+00:00 journal post-YAPC: more on the Armenian Difference Engine http://use.perl.org/~aurum/journal/37229?from=rss <p>(I did promise a more cheerful post, and at the moment, my alternative to writing this is to read more articles about the 15th century eastern Mediterranean, and try to come up with a research proposal, whose success or failure will determine whether I get to continue to have an academic career. No pressure or anything.)</p><p>So YAPC this year was a blast, just like last year. I spent much of Day 1 vaguely out of sorts as I generally do ("who do I want to say hello to and haven't found yet? What am I missing? What are the cool kids doing that no one has told me about?" etc.) but that had thankfully passed by the end of the day. Day 2 was overshadowed by rising worry about my presentation ("will I remember what I want to say? Will I run embarrassingly over? Will it make any sense?") which subsided for the duration of the conference dinner and came back in force on the morning of day 3. The presentation went well though; I got a lot of compliments and a few good questions, and was mostly unable to eat my lunch due to being kept talking about various things. I'll take that last as a sign of success.</p><p>What I will probably do here in this journal, over the next little while, is go over the data model and algorithm model of my collation engine&#8212;basically all the technical bits of my project that were left out of my presentation for being heavy on explanation and light on humor. Some of you will probably have better ideas than I would about the various data models, and many of you will have opinions about DB design (since I haven't yet implemented data persistence.) Lots of you will be way better at algorithms than I am. With any luck I can even get a few of you to volunteer opinions on user interfaces, which is the next best thing to releasing the core module on CPAN and letting someone else write a UI.</p><p>The "Armenian studies" version of this presentation will be on 11 September in Paris, so I'm hoping to make a little more progress on the code before then, though I'll still need time to make the presentation a little more palatable to a very different audience. (Really sorry, domm, but the yak will have to go. They just wouldn't get it.)</p> aurum 2008-08-19T15:26:35+00:00 journal hello world http://use.perl.org/~aurum/journal/36216?from=rss <p>So it turns out that <a href="/~nicholas">Nicholas</a> keeps posting things here to which I want to comment; thus, I finally had to create an account here. Hello, world.</p><p>I've just come back from a trip that included visits to perl mongers in Vienna and Amsterdam; I blogged things about that trip <a href="http://www.eccentricity.org/news/2008/04/">elsewhere</a>.</p><p>The real reason I'm posting, however, is to talk about (natural) languages, which has become something of a habit for me. In <a href="http://use.perl.org/~nicholas/journal/36195">this post</a>, Nicholas was looking for a Latinate word for the (bad, evil, wrong, no one should do it ever, etc.) act of killing a kitten. Now many people know that a cat is a Felis domesticus, so cat-killing would be "felicide". Damian pointed out, and I concur, that the obvious word for kitten would be "felis" (or "feles", the proper nominative spelling) with a diminutive suffix, which gives "feliculus". All this would give us "feliculicide", which works, but is kind of an unwieldy mouthful.</p><p>But it turns out that there is another word in Latin that can mean cat: coincidentally enough, that word is "catus" or "cattus". This word (along with its Greek cousin "&#954;&#940;&#964;&#964;&#945;") has a direct (though not necessarily causative&#8212;the source for this word is very hard to pin down) relationship to the German "Katze" and the English "cat"; Cassell's English -&gt; Latin section actually gives "catulus" for "kitten". On the other hand, Lewis &amp; Short think that "cattus" is a word for an unknown sort of animal, and that although "catulus" can be the young of any animal, it especially refers to puppies. But "catulicide" is easier for English-speakers to say that "feliculicide", with its uncomfortable extra syllable.</p><p>Really I think this is just an example of Latin as an evolving language. Lewis &amp; Short aren't particularly keen on admitting as proper Latin anything that wasn't attested before the end of the Roman Republic, or thereabouts. I think it rather likely, though I couldn't tell everything I needed to from the Thesaurus Linguae Latinae, that "cattus" was the usual medieval word for "cat", and that "catulicide" (the word, not the act itself, mind) will serve just fine.</p> aurum 2008-04-22T13:25:17+00:00 journal