Elasticsearch - How to get popular words list of documents

Doing a simple term aggregation search will meet your needs:

(where mydata is the name of your field)

curl -XGET 'http://localhost:9200/test/data/_search?search_type=count&pretty' -d '{
  "query": {
    "match_all" : {}
  },
  "aggs" : {
      "mydata_agg" : {
    "terms": {"field" : "mydata"}
    }
  }
}'

will return:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "mydata_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "aaa",
        "doc_count" : 3
      }, {
        "key" : "fff",
        "doc_count" : 3
      }, {
        "key" : "bbb",
        "doc_count" : 2
      }, {
        "key" : "ccc",
        "doc_count" : 1
      }, {
        "key" : "ddd",
        "doc_count" : 1
      }, {
        "key" : "eee",
        "doc_count" : 1
      }, {
        "key" : "hhh",
        "doc_count" : 1
      }, {
        "key" : "mmm",
        "doc_count" : 1
      }, {
        "key" : "xxx",
        "doc_count" : 1
      } ]
    }
  }
}

It might be because this question and the accepted answer are some years old, but now there is a better way.

The accepted answer does not take into account the fact that the most common words are usually uninteresting, e.g. stopwords such as "the", "a", "in", "for" and so on.

This is usually the case for fields that contain data of type text and not keyword.

This is why ElasticSearch actually has an aggregation specifically for this purpose called Significant Text Aggregation.
From the docs:

It is specifically designed for use on type text fields
It does not require field data or doc-values
It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.

It can, however, take longer than other kinds of queries, so it is suggested to use this after filtering the data with a query.match, or with a previous aggregation of type sampler.

So, in your case you would send a query like this (leaving out the filtering/sampling):

{
    "aggs": {
        "keywords": {
            "significant_text": {
                "field": "myfield"
            }
        }
    }
}

Elasticsearch - How to get popular words list of documents

Tags:

Elasticsearch

Related

Recent Posts