Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

chromatic (983)

  (email not shown publicly)

Blog Information [] Profile for chr0matic []

Journal of chromatic (983)

Friday September 02, 2005
05:12 PM

Secrets of Contextual Analysis

[ #26579 ]

I'm analyzing the content of some documents in order to find potential correlations between them. Breaking each document into individual words, stemming those words, and throwing out the stopwords gave me some 18,000 unique words from a 600-document corpus, with over 40% of words appearing only once in the corpus and almost 80% of the words appearing fewer than ten times.

I knew my existing list of stop words was insufficient, but I really don't want to pick out the top 1000 or 2000 useful words from a list of 18,000, especially because this is a test corpus of perhaps 7% of the actual corpus.

Now I start to wonder if some of the lexical analysis modules would be useful in picking out only the nouns (unstemmed) and verbs (stemmed) from a document, rather than taking all of the words of a document as significant. The correlation algorithm appears sound, but if I can throw out lots of irrelevant data, I can improve the performance and utility of the application.

Any thoughts?

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • What sort of correlations are you looking for? I was focusing on detecting plagiarism [] at one point and found that breaking things down by sentence was more useful. As I don't know what you're trying to do, I've no idea if that link will prove useful.

    • Detecting plagiarism is much more specific than this problem. I want to be able to analyze a document and suggest a handful of other documents that, from their intertextual context at least, appear to discuss similar things. For example, a tutorial about creating homemade pizza dough is probably not very similar to a journal entry about linguistic analysis, but probably is similar to an article discussing different types of pizza ovens.

      I'm trying to answer the question "Do the relevant topics of these

      • I see. That makes sense. Perhaps a heuristic approach is best as there are few algorithms likely to realize that "July heat wave" and "dog days of summer" might be related, though when the text is long enough idiomatic expressions are likely to come out in the wash.

        My initial thought would be to try to score words in documents. Take the words that appear the most frequently and somehow correlate their frequency in the document by their infrequency in the language. Thus, the least common words which ap

        • From WordNet::Similarity.. there is an algorithm in there by Lesk (from a paper in 1986 which Citeseer and Scholar don't seem to find a reference for); which uses something they call gloss overlap.

          What Lesk basically does is what Ovid suggests, except the calculation is performed for Wordnet glossary definitions and not for entire documents.

        • Your initial thought is called Zipf's law in information retrieval.
      • So, something similar to Amazon's "statistically improbable phrases"?
  • Perhaps Latent Semantic Analysis / Contextual Network Graphs are the hammer you're looking for: [] [] I _know_ there was code on CPAN to do this in 2003, but I can't find it.
  • You could use Ted Pedersen's Wordnet::Similarity modules. This attaches a numerical value to any two words and can help you identify which words are related, and how closely. I prefer jcn (the Jiang Conrath method) myself, but there are 10 different techniques on offer.

    Also, would it not make sense to use a POS (Part of Speech) tagger before you break down stop words and so on ? I can't recommend a Perl based POS tagger offhand, since most of my work in this area is done in Java... but I'm pretty sure the

  • There are many solutions in information retrieval. You can compute a cosine measure between a interesting document and each document of the corpus. You can too train a bayesian network to categorize your documents. Some references : - Information Retrieval / C. J. van RIJSBERGEN . - [] (and specificaly the 3rd chapter : []) - Bayesian Analysis For RSS Reading / Simon Cozens, in The Perl Journal, March 2004 - Building a