Python program that finds most frequent word in a .txt file, Must print word and its count

If you need to count a number of words in a passage, then it is better to use regex.

Let's start with a simple example:

import re

my_string = "Wow! Is this true? Really!?!? This is crazy!"

words = re.findall(r'\w+', my_string) #This finds words in the document

Result:

>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']

Note that "Is" and "is" are two different words. My guess is that you want the to count them the same, so we can just capitalize all the words, and then count them.

from collections import Counter

cap_words = [word.upper() for word in words] #capitalizes all the words

word_counts = Counter(cap_words) #counts the number each time a word appears

Result:

>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})

Are you good up to here?

Now we need to do exactly the same thing we did above just this time we are reading a file.

import re
from collections import Counter

with open('your_file.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word.upper() for word in words]

word_counts = Counter(cap_words)

This program is actually a 4-liner, if you use the powerful tools at your disposal:

with open(yourfile) as f:
    text = f.read()

words = re.compile(r"[\w']+", re.U).findall(text)   # re.U == re.UNICODE
counts = collections.Counter(words)

The regular expression will find all words, irregardless of the punctuation adjacent to them (but counting apostrophes as part of the word).

A counter acts almost just like a dictionary, but you can do things like counts.most_common(10), and add counts, etc. See help(Counter)

I would also suggest that you not make functions printBy..., since only functions without side-effects are easy to reuse.

def countsSortedAlphabetically(counter, **kw):
    return sorted(counter.items(), **kw)

#def countsSortedNumerically(counter, **kw):
#    return sorted(counter.items(), key=lambda x:x[1], **kw)
#### use counter.most_common(n) instead

# `from pprint import pprint as pp` is also useful
def printByLine(tuples):
    print( '\n'.join(' '.join(map(str,t)) for t in tuples) )

Demo:

>>> words = Counter(['test','is','a','test'])
>>> printByLine( countsSortedAlphabetically(words, reverse=True) )
test 2
is 1
a 1

edit to address Mateusz Konieczny's comment: replaced [a-zA-Z'] with [\w']... the character class \w, according to the python docs, "Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched." (... but apparently doesn't match an apostrophe...) However \w includes _ and 0-9, so if you don't want those and you aren't working with unicode, you can use [a-zA-Z']; if you are working with unicode you'd need to do a negative assertion or something to subtract [0-9_] from the \w character class


You have a simple typo, words where you want word.

Edit: You appear to have edited the source. Please use copy and paste to get it right the first time.

Edit 2: Apparently you're not the only one prone to typos. The real problem is that you have lines where you want line. I apologize for accusing you of editing the source.

Tags:

Python