Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

aurum (8572)

aurum
  (email not shown publicly)
http://www.eccentricity.org/

Ex-Akamaite, ex-Goldmanite. Currently working on Ph.D. in Armenian/Byzantine history at Oxford. Spends more time these days deciphering squiggly characters than spaghetti code. Thinks that UTF-8 is the best thing since sliced bread.

Journal of aurum (8572)

Sunday August 24, 2008
06:11 PM

variations on "variants"

[ #37265 ]

Now that I've explained this much of my design, I am going to have to apologize for confusing my readers, because I'm about to overload a term. (This just goes to show how terrible I am at nomenclature when programming.)

Anyone who has ever looked at a critical edition of any text will see the word variant tossed about casually, alongside the word reading, under the assumption that both of these words are self-explanatory. I have told the manuscript collation engine (MCE) that a "variant" is any word that is not a "match" to its base word. Now I'm going to have to tell the rest of my program something else.

From the point of view of the reader of a critical edition, a "variant" is a chunk of text (be it one word or several) that does not appear in the base text, but appears in the apparatus below the base text. An apparatus is a specially formatted block of footnotes below the main text of an edition. The footnotes encode information about the "variant readings" found in manuscripts. Peter Robinson, who worked on this sort of thing long before I did, defined a "reading" as a "minimally distinctive unit of sense". (ref) A "unit of sense" is usually a word, but could be, say, half of a compound word. On the other hand, when you have several "units of sense" lined up in one sentence in one manuscript, it is most efficent and understandable to present them as a single reading. The "variant" readings, therefore, are the ones that vary from the base text.

So you may begin to see the problem. I've been talking about a variant as if it were always a single word, always defined in reference to the first text in which I saw any word at all in that position, and always significantly different from the first word encountered. This is great for a first pass of difference detection, but if I published that information unmodified my edition would be incomprehensible rubbish.

As an editor, therefore, a variant is defined in relationship to what I've chosen as the "right" word, and it is any difference at all within reason. And then I have to define what my bounds of reason are. Those bounds are often defined arbitrarily by the editor. One editor may choose, due to space constraints, to publish only those variants which he judges "substantial". Another may choose to publish all variants that aren't simply orthographic variations. Another may choose to publish all variants that make some sort of grammatical sense, and omit the ungrammatical ones. Editions of the New Testament usually include everything, because minute differences have a huge impact upon theological study. There are a few online editions appearing; they tend to include everything, because space constraint is not an issue.

What's more, Armenian is an inflected language. The same word can have a different grammatical meaning with a different suffix. The MCE will record the two words as a fuzzy match, but in fact I am going to have to review them and decide whether this "match" represents a sensical variant, a nonsensical (ungrammatical, or misspelled) variant, or simply a variation in orthography.

In fact, the only reason I told the MCE to pay attention to "variants" in the first place is to make my editing job easier in the future. It is useful for me to only have to consider the "similar" words together, and for the computer to reserve the "different" words in the same position for separate consideration. The MCE is only the core of the larger editing program I need, and that editing program must be able to learn from my decisions. That is, if I mark հա՛ոց as an orthographic variation of հայոց in one place, I should not be asked again about that pair of words. This will not only save me a lot of trouble; it will allow me to construct a more consistent, and therefore better, edition.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.