How can I get the list of unique terms from a specific field in Lucene?
You're looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You'll use IndexReader's getTermFreqVector(docid, field) for each document in the index, and populate a HashSet
with them.
The alternative would be to use terms() and pick only terms for the field you're interested in:
IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
while (terms.next()) {
final Term term = terms.term();
if (term.field().equals("field_name")) {
uniqueTerms.add(term.text());
}
}
This is not the optimal solution, you're reading and then discarding all other fields. There's a class Fields
in Lucene 4, that returns terms(field) only for a single field.
If you are using the Lucene 4.0 api, you need to get the fields out of the index reader. The Fields then offers the way to get the terms for each field in the index. Here is an example of how to do that:
Fields fields = MultiFields.getFields(indexReader);
Terms terms = fields.terms("field");
TermsEnum iterator = terms.iterator(null);
BytesRef byteRef = null;
while((byteRef = iterator.next()) != null) {
String term = new String(byteRef.bytes, byteRef.offset, byteRef.length);
}
Eventually, for the new version of Lucene you can get the string from the BytesRef calling:
byteRef.utf8ToString();
instead of
new String(byteRef.bytes, byteRef.offset, byteRef.length);
If you want to get the document frequency, you can do :
int docFreq = iterator.docFreq();
Same result, just a little cleaner, is to use the LuceneDictionary
in the lucene-suggest
package. It takes care of a field that does not contain any terms by returning an BytesRefIterator.EMPTY
. That will save you a NPE :)
LuceneDictionary ld = new LuceneDictionary( indexReader, "field" );
BytesRefIterator iterator = ld.getWordsIterator();
BytesRef byteRef = null;
while ( ( byteRef = iterator.next() ) != null )
{
String term = byteRef.utf8ToString();
}