python dictionary count of unique values

Over 6 years after answering, someone pointed out to me I misread the question. While my original answer (below) counts unique keys in the input sequence, you actually have a different count-distinct problem; you want to count values per key.

To count unique values per key, exactly, you'd have to collect those values into sets first:

values_per_key = {}
for d in iterable_of_dicts:
    for k, v in d.items():
        values_per_key.setdefault(k, set()).add(v)
counts = {k: len(v) for k, v in values_per_key.items()}

which for your input, produces:

>>> values_per_key = {}
>>> for d in iterable_of_dicts:
...     for k, v in d.items():
...         values_per_key.setdefault(k, set()).add(v)
...
>>> counts = {k: len(v) for k, v in values_per_key.items()}
>>> counts
{'abc': 3, 'xyz': 1, 'pqr': 4}

We can still wrap that object in a Counter() instance if you want to make use of the additional functionality this class offers, see below:

>>> from collections import Counter
>>> Counter(counts)
Counter({'pqr': 4, 'abc': 3, 'xyz': 1})

The downside is that if your input iterable is very large the above approach can require a lot of memory. In case you don't need exact counts, e.g. when orders of magnitude suffice, there are other approaches, such as a hyperloglog structure or other algorithms that 'sketch out' a count for the stream.

This approach requires you install a 3rd-party library. As an example, the datasketch project offers both HyperLogLog and MinHash. Here's a HLL example (using the HyperLogLogPlusPlus class, which is a recent improvement to the HLL approach):

from collections import defaultdict
from datasketch import HyperLogLogPlusPlus

counts = defaultdict(HyperLogLogPlusPlus)

for d in iterable_of_dicts:
    for k, v in d.items():
        counts[k].update(v.encode('utf8'))

In a distributed setup, you could use Redis to manage the HLL counts.

My original answer:

Use a collections.Counter() instance, together with some chaining:

from collections import Counter
from itertools import chain

counts = Counter(chain.from_iterable(e.keys() for e in d))

This ensures that dictionaries with more than one key in your input list are counted correctly.

Demo:

>>> from collections import Counter
>>> from itertools import chain
>>> d = [{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})

or with multiple keys in the input dictionaries:

>>> d = [{"abc":"movies", 'xyz': 'music', 'pqr': 'music'}, {"abc": "sports", 'pqr': 'movies'}, {"abc": "music", 'pqr': 'sports'}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})

A Counter() has additional, helpful functionality, such as the .most_common() method that lists elements and their counts in reverse sorted order:

for key, count in counts.most_common():
    print '{}: {}'.format(key, count)

# prints
# 5: pqr
# 3: abc
# 1: xyz

What you're describing--a list with multiple values for each key--would be better visualized by something like this:

{'abc': ['movies', 'sports', 'music'],
 'xyz': ['music'],
 'pqr': ['music', 'movies', 'sports', 'news']
}

In that case, you have to do a bit more work to insert:

Lookup key to see if it already exists
- If doesn't exist, create new key with value [] (empty list)
Retrieve value (the list associated with the key)
Use if value in to see if the value being checked exists in the list
If the new value isn't in, .append() it

This also leads to an easy way to count the total number of elements stored:

# Pseudo-code
for myKey in myDict.keys():
    print "{0}: {1}".format(myKey, len(myDict[myKey])

No need of using counter. You can achieve in this way:

# input dictionary
d=[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]

# fetch keys
b=[j[0] for i in d for j in i.items()]

# print output
for k in list(set(b)):
    print "{0}: {1}".format(k, b.count(k))

python dictionary count of unique values

Tags:

Python

Dictionary

Related

Recent Posts