Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • What sort of correlations are you looking for? I was focusing on detecting plagiarism [] at one point and found that breaking things down by sentence was more useful. As I don't know what you're trying to do, I've no idea if that link will prove useful.

    • Detecting plagiarism is much more specific than this problem. I want to be able to analyze a document and suggest a handful of other documents that, from their intertextual context at least, appear to discuss similar things. For example, a tutorial about creating homemade pizza dough is probably not very similar to a journal entry about linguistic analysis, but probably is similar to an article discussing different types of pizza ovens.

      I'm trying to answer the question "Do the relevant topics of these

      • I see. That makes sense. Perhaps a heuristic approach is best as there are few algorithms likely to realize that "July heat wave" and "dog days of summer" might be related, though when the text is long enough idiomatic expressions are likely to come out in the wash.

        My initial thought would be to try to score words in documents. Take the words that appear the most frequently and somehow correlate their frequency in the document by their infrequency in the language. Thus, the least common words which ap

        • From WordNet::Similarity.. there is an algorithm in there by Lesk (from a paper in 1986 which Citeseer and Scholar don't seem to find a reference for); which uses something they call gloss overlap.

          What Lesk basically does is what Ovid suggests, except the calculation is performed for Wordnet glossary definitions and not for entire documents.

        • Your initial thought is called Zipf's law in information retrieval.
      • So, something similar to Amazon's "statistically improbable phrases"?
  • Perhaps Latent Semantic Analysis / Contextual Network Graphs are the hammer you're looking for: [] [] I _know_ there was code on CPAN to do this in 2003, but I can't find it.
  • You could use Ted Pedersen's Wordnet::Similarity modules. This attaches a numerical value to any two words and can help you identify which words are related, and how closely. I prefer jcn (the Jiang Conrath method) myself, but there are 10 different techniques on offer.

    Also, would it not make sense to use a POS (Part of Speech) tagger before you break down stop words and so on ? I can't recommend a Perl based POS tagger offhand, since most of my work in this area is done in Java... but I'm pretty sure the

  • There are many solutions in information retrieval. You can compute a cosine measure between a interesting document and each document of the corpus. You can too train a bayesian network to categorize your documents. Some references : - Information Retrieval / C. J. van RIJSBERGEN . - [] (and specificaly the 3rd chapter : []) - Bayesian Analysis For RSS Reading / Simon Cozens, in The Perl Journal, March 2004 - Building a