Thursday July 29, 2010
Stupid Lucene Tricks: Document Frequencies and NOT
- You can get the document frequency of a term (i.e. how many documents have that term) through Lucene.Index.IndexReader.DocFreq(t As Term) As Integer.
- You can get the IndexReader for a Lucene.Search.IndexSearcher through IndexSearcher.GetIndexReader().
- If you want to display the document frequencies for the individual keywords of a search, and a piece is a NOT phrase (like -antibiotic in antimicrobial -antibiotic), you cannot use DocFreq() directly. In that case, the document frequency can be computed as:
DOCFREQ = count of all documents - DocFreq(TERM_NO_NOT)
DOCFREQ = 60227 - DocFreq(New Term("all", "antibiotic"))
where the NOT piece was -antibiotic and all is the Lucene document field in question.
(Ob. Perl: Although PLucene is now 5 years out of date, Perlesque should eventually let you get at Lucene.NET via a strongly-typed Perl 6.)