Elasticsearch - How to get popular words list of documents
Doing a simple term aggregation search will meet your needs:
(where mydata
is the name of your field)
curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
"query": {
"match_all" : {}
},
"aggs" : {
"mydata_agg" : {
"terms": {"field" : "mydata"}
}
}
}'
will return:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"mydata_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "aaa",
"doc_count" : 3
}, {
"key" : "fff",
"doc_count" : 3
}, {
"key" : "bbb",
"doc_count" : 2
}, {
"key" : "ccc",
"doc_count" : 1
}, {
"key" : "ddd",
"doc_count" : 1
}, {
"key" : "eee",
"doc_count" : 1
}, {
"key" : "hhh",
"doc_count" : 1
}, {
"key" : "mmm",
"doc_count" : 1
}, {
"key" : "xxx",
"doc_count" : 1
} ]
}
}
}
It might be because this question and the accepted answer are some years old, but now there is a better way.
The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on.
This is usually the case for fields that contain data of type text
and not keyword
.
This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation.
From the docs:
- It is specifically designed for use on type
text
fields - It does not require field data or doc-values
- It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.
It can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler.
So, in your case you would send a query like this (leaving out the filtering/sampling):
{
"aggs": {
"keywords": {
"significant_text": {
"field": "myfield"
}
}
}
}