Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

aurum (8572)

aurum
  (email not shown publicly)
http://www.eccentricity.org/

Ex-Akamaite, ex-Goldmanite. Currently working on Ph.D. in Armenian/Byzantine history at Oxford. Spends more time these days deciphering squiggly characters than spaghetti code. Thinks that UTF-8 is the best thing since sliced bread.

Journal of aurum (8572)

Monday September 01, 2008
10:35 PM

Progress, of sorts

[ #37325 ]

I now have a script which produces output that looks like this. Each capital letter represents a manuscript. (OK, so in real life the words are lined up in columns, but I can't make use.perl play nicely with Unicode characters inside an <ecode> tag, which is the one that would preserve spacing.)


Word variation! Context:
մինչեւ ցայս վայրս բազմաջան եւ եւ աշխատաւոր քննութեամբ գտեալ գրեցաք >> ի զշարագրական գրեալս զհարիւրից ամաց, զորս ի << բազում ժամանակաց հետա հետաքննեալ հասու եղաք։ ընդ այնքանեաց տեսողացն եւ

Base ի զշարագրական գրեալս զհարիւրից ամաց, զորս ի
----
ABH: զշարագրական գրեալս զհարիւրից ամաց զորս ի
G: զշարագրական գրեալսն հարիւրից ամաց զորս ի
C: ի ժամանակական գրեալս հարիւրից ամացն զորս
J: զշարագրական գրեալս զճից ամաց զոր ի
DFI: զշարագրական գրեալս զճից ամաց զորս ի
E: զշարագրական գրեալս զճ ամաց զորս ի

Of course it doesn't take any input yet. One thing at a time.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I had a long reply using Text::Unidecode here, but use.perl.org *really* doesn't want to format things the way I want it to (half the time it seems to double-encode my unicode, and never do multiline code or pre tags!), so I'll try using words instead of pictures to explain what I'm trying to talk about. First, the easy question: How are those alternate readings sorted? It doesn't seem to be by first ms with that reading, nor by number of readings -- is it just hash order? Second, the hard question -- wh
    • Q1) Alternate readings are unsorted. That is intentional - I don't want to inadvertently give priority to the reading in ms A, or the reading with the most words, or anything.

      Q2) Alignment can occur in one of two ways. The first is by a small enough edit distance, as you say - it's how I keep the instances of "zsharagrakan" aligned. That was my "fuzzy match." The second is what is called a "negative variant" - the words aren't alike at all, but they coincide in placement. This is why "zhamanakakan" is

    • half the time it seems to double-encode my unicode

      Put posts and comments through “encode 'us-ascii', $your_post, Encode::HTMLCREF”. That will make them come out as intended.

      and never do multiline code or pre tags!

      That’s on purpose; Slashcode has its own special <ecode> tag for that purpose (whose distinguishing features are: 1. you can write raw angle brackets and ampersands inside, and Slash will turn them into entities for you; 2. it uses <pre>, so very long lines will wrap

      • Ah. Perlmonks does that to, but it calls the new tag code. (Or c, for the lazy.)
      • Slashcode has its own special <ecode> tag for that purpose (whose distinguishing features are: 1. you can write raw angle brackets and ampersands inside, and Slash will turn them into entities for you;

        This is the part that doesn't play nicely with UTF-8, actually, although the <ecode> tag is almost always what I want - the Armenian characters get converted into entities upon comment submit, and those entities themselves have their ampersands turned into entities upon ecode conversion.

        • The conversion to entities is your browser’s doing, actually. It sees that the form should be submitted in ISO-Latin1, so it turns all the non-Latin1 characters into entities. Slashcode can’t actually know that you didn’t mean to send them that way. There is therefore no way to get around this.

          All you can do is use plain <code> tags with <br> tags for linebreaks, sequences of &nbsp; for tabs, and manual escaping for ampersands and less-thans. It’s a pain to do manuall