NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.
All the Perl that's Practical to Extract and Report
Stories, comments, journals, and other submissions on use Perl; are Copyright 1998-2006, their respective owners.
a couple of suggestions (Score:1)
You could use Ted Pedersen's Wordnet::Similarity modules. This attaches a numerical value to any two words and can help you identify which words are related, and how closely. I prefer jcn (the Jiang Conrath method) myself, but there are 10 different techniques on offer.
Also, would it not make sense to use a POS (Part of Speech) tagger before you break down stop words and so on ? I can't recommend a Perl based POS tagger offhand, since most of my work in this area is done in Java... but I'm pretty sure they must exist. Do a POS tag (to find out noun, verb, adjective contexts for individual sentences), then do what you're doing now. This way, you would get both the word + the part of speech tag. For example, like has at least seven different contexts in which it may be used... which range from verb to adjective. fling could be either a noun or a verb.. and so on. Might give you a bit more granularity to work with..
Reply to This