I'm analyzing the content of some documents in order to find potential correlations between them. Breaking each document into individual words, stemming those words, and throwing out the stopwords gave me some 18,000 unique words from a 600-document corpus, with over 40% of words appearing only once in the corpus and almost 80% of the words appearing fewer than ten times.
I knew my existing list of stop words was insufficient, but I really don't want to pick out the top 1000 or 2000 useful words from a list of 18,000, especially because this is a test corpus of perhaps 7% of the actual corpus.
Now I start to wonder if some of the lexical analysis modules would be useful in picking out only the nouns (unstemmed) and verbs (stemmed) from a document, rather than taking all of the words of a document as significant. The correlation algorithm appears sound, but if I can throw out lots of irrelevant data, I can improve the performance and utility of the application.