Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

aurum (8572)

  (email not shown publicly)

Ex-Akamaite, ex-Goldmanite. Currently working on Ph.D. in Armenian/Byzantine history at Oxford. Spends more time these days deciphering squiggly characters than spaghetti code. Thinks that UTF-8 is the best thing since sliced bread.

Journal of aurum (8572)

Saturday August 23, 2008
06:33 PM

more on word relationships

[ #37261 ]

It was brought to my attention in a comment on my last post that I didn't do a very good job describing the relationships between words that I create. I'll try to fix this here.

It's really difficult to construct good examples in English, incidentally; we don't have a lot of prefixes or suffixes or case endings, so pretend for the moment that the samples I give in this post are all grammatically valid. (Don't make me break out the lolcat.) That said, given an example set of texts:

Tara has a lot of books about languages.
Tara had alot book to do with languages.
Tera got a lot of book to do with languages.

the collator would line them up thus, as I described previously:

   0    1   2 3    4  5     6     7  8    9
A) Tara has a lot  of books about         languages.
B) Tara had   alot    book  to    do with languages.
C) Tera got a lot  of book  too   do with languages.

The base text generated from this would then be:

Tara has a lot of books about do with languages.

Since each word in the base text comes from the top, it is this word that contains linkage information for all the other words. So for this base text we would have:

->    FUZZYMATCH: Tera
->    FUZZYMATCH: had
->    VARIANT: got
->    FUZZYMATCH: alot
->    FUZZYMATCH: book
->    VARIANT: to

This does not, however, list every unique word that appears in every column of the texts above. For that, I need to also record the relationship between "to" and "too" in column 6. When the collator finds "too", and fails to find a match with "about", it will look through the list of variants attached to about, find "to", and add "too" as a FUZZYMATCH for it. So the relevant snippet of data structure becomes

-> VARIANT: to

I appear to have been waylaid by a cat, and anyway I've taken up a lot of screen space by drawing out datastructures, so I'll continue tomorrow.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.