FreqDist in NLTK not sorting output

From NLTK's GitHub:

FreqDist in NLTK3 is a wrapper for collections.Counter; Counter provides most_common() method to return items in order. FreqDist.keys() method is provided by standard library; it is not overridden. I think it is good we're becoming more compatible with stdlib.

docs at googlecode are very old, they are from 2011. More up-to-date docs can be found on http://nltk.org website.

So for NLKT version 3, instead of fdist1.keys()[:50], use fdist1.most_common(50).

The tutorial has also been updated:

fdist1 = FreqDist(text1)
>>> print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50)
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),
("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
>>> fdist1['whale']
906

As an alternative to using FreqDist, you can simply use Counter from `collections, see also https://stackoverflow.com/questions/22952069/how-to-get-the-rank-of-a-word-from-a-dictionary-with-word-frequencies-python/22953416#22953416 :

>>> from collections import Counter
>>> text = """foo foo bar bar foo bar hello bar hello world  hello world hello world hello world  hello world hello hello hello"""
>>> dictionary = Counter(text.split())
>>> dictionary
{"foo":3, "bar":4, "hello":9, "world":5}
>>> dictionary.most_common()
[('hello', 9), ('world', 5), ('bar', 4), ('foo', 3)]
>>> [i[0] for i in dictionary.most_common()]
['hello', 'world', 'bar', 'foo']

This answer is old. Use this answer instead.

In order to troubleshoot this issue, I would recommend taking the following steps:

1. Check which version of nltk you are using:

>>> import nltk
>>> print nltk.__version__
2.0.4  # preferably 2.0 or higher

Older versions of nltk do not have a sortable FreqDist.keys method.

2. Verify that you have not inadvertently modified text1 or vocabulary1:

Open a new shell and start the process over again from the beginning:

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> from nltk import FreqDist
>>> fdist1 = FreqDist(text1)
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']

Note that vocabulary1 should not contain the string u'succour' (the first unicode string in the output of your original post):

>>> vocabulary1.count(u'succour')  # vocabulary1 does **not** contain the string u'succour'
0

3. If you are still having trouble, inspect your source code and text lists to make sure they match what you see below:

>>> import inspect
>>> print inspect.getsource(FreqDist.keys)  # make sure your source code matches the source code below
    def keys(self):
        """
        Return the samples sorted in decreasing order of frequency.

        :rtype: list(any)
        """
        self._sort_keys_by_value()
        return map(itemgetter(0), self._item_cache)

>>> print inspect.getsource(FreqDist._sort_keys_by_value)  # and matches this source code
    def _sort_keys_by_value(self):
        if not self._item_cache:
            self._item_cache = sorted(dict.items(self), key=lambda x:(-x[1], x[0]))  # <= check this line especially

>>> text1[:40]  # does the first part of your text list match this one?
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar', 'School', ')', 'The', 'pale', 'Usher', '--', 'threadbare', 'in', 'coat', ',', 'heart', ',', 'body', ',', 'and', 'brain', ';', 'I', 'see', 'him']

>>> text1[-40:]  # and what about the end of your text list?
['second', 'day', ',', 'a', 'sail', 'drew', 'near', ',', 'nearer', ',', 'and', 'picked', 'me', 'up', 'at', 'last', '.', 'It', 'was', 'the', 'devious', '-', 'cruising', 'Rachel', ',', 'that', 'in', 'her', 'retracing', 'search', 'after', 'her', 'missing', 'children', ',', 'only', 'found', 'another', 'orphan', '.']

If your source code or text lists do not match the above exactly, consider re-installing nltk with the most recent stable version.


import nltk
fdist1 = nltk.FreqDist(text)

fdist1 contains 'key' - for words, 'values' - for frequency count of words.

The above variable fdist1 is not sorted hence it will not print top 50 results based on the command. Please use the following code to first sort them:

sorted_fdist1 = sorted(fdist1 , key = fdist1.__getitem__, reverse = True)
sorted_fdist1[0:50]

This will print out the top 50 frequent words.

Tags:

Python

Nlp

Nltk