Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • What sort of correlations are you looking for? I was focusing on detecting plagiarism [perlmonks.org] at one point and found that breaking things down by sentence was more useful. As I don't know what you're trying to do, I've no idea if that link will prove useful.

    • Detecting plagiarism is much more specific than this problem. I want to be able to analyze a document and suggest a handful of other documents that, from their intertextual context at least, appear to discuss similar things. For example, a tutorial about creating homemade pizza dough is probably not very similar to a journal entry about linguistic analysis, but probably is similar to an article discussing different types of pizza ovens.

      I'm trying to answer the question "Do the relevant topics of these

      • I see. That makes sense. Perhaps a heuristic approach is best as there are few algorithms likely to realize that "July heat wave" and "dog days of summer" might be related, though when the text is long enough idiomatic expressions are likely to come out in the wash.

        My initial thought would be to try to score words in documents. Take the words that appear the most frequently and somehow correlate their frequency in the document by their infrequency in the language. Thus, the least common words which ap

        • From WordNet::Similarity.. there is an algorithm in there by Lesk (from a paper in 1986 which Citeseer and Scholar don't seem to find a reference for); which uses something they call gloss overlap.

          What Lesk basically does is what Ovid suggests, except the calculation is performed for Wordnet glossary definitions and not for entire documents.