Count letter frequency in word list, excluding duplicates in the same word

A variation on @Primusa answer without using update:

from collections import Counter

words = ["tree", "bone", "indigo", "developer"]
counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())

Output

Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})

Basically convert each word to a set and then iterate over each set.

One without Counter

words=["tree","bone","indigo","developer"]
d={}
for word in words:         # iterate over words
    for i in set(word):    # to remove the duplication of characters within word
        d[i]=d.get(i,0)+1

Output

{'b': 1,
 'd': 2,
 'e': 3,
 'g': 1,
 'i': 1,
 'l': 1,
 'n': 2,
 'o': 3,
 'p': 1,
 'r': 2,
 't': 1,
 'v': 1}

Create a counter object and then update it with sets for each word:

from collections import Counter

wordlist = ["tree","bone","indigo","developer"]

c = Counter()
for word in wordlist:
    c.update(set(word.lower()))

print(c)

Output:

Counter({'e': 3, 'o': 3, 'r': 2, 'n': 2, 'd': 2, 't': 1, 'b': 1, 'i': 1, 'g': 1, 'v': 1, 'p': 1, 'l': 1})

Note that although letters that weren't present in wordlist aren't present in in the Counter, this is fine because a Counter behaves like a defaultdict(int), so accessing a value not present automatically returns a default value of 0.

Comparing speed of the solutions presented so far:

def f1(words):
    c = Counter()
    for word in words:
        c.update(set(word.lower()))
    return c

def f2(words):
    return Counter(
        c
        for word in words
        for c in set(word.lower()))

def f3(words):
    d = {}
    for word in words:
        for i in set(word.lower()):
            d[i] = d.get(i, 0) + 1
    return d

My timing function (using different sizes for the list of words):

word_list = [
    'tree', 'bone', 'indigo', 'developer', 'python',
    'language', 'timeit', 'xerox', 'printer', 'offset',
]

for exp in range(5):
    words = word_list * 10**exp

    result_list = []
    for i in range(1, 4):
        t = timeit.timeit(
            'f(words)',
            'from __main__ import words,  f{} as f'.format(i),
            number=100)
        result_list.append((i, t))

    print('{:10,d} words | {}'.format(
        len(words),
        ' | '.join(
            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

The results:

        10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
       100 words | f1   0.0245 sec | f2   0.0082 sec | f3   0.0113 sec
     1,000 words | f1   0.2450 sec | f2   0.0812 sec | f3   0.1134 sec
    10,000 words | f1   2.4601 sec | f2   0.8113 sec | f3   1.1335 sec
   100,000 words | f1  24.4195 sec | f2   8.1828 sec | f3  11.2167 sec

The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).

Count letter frequency in word list, excluding duplicates in the same word

Tags:

Python

Algorithm

Related

Recent Posts