python dictionary count of unique values
Over 6 years after answering, someone pointed out to me I misread the question. While my original answer (below) counts unique keys in the input sequence, you actually have a different count-distinct problem; you want to count values per key.
To count unique values per key, exactly, you'd have to collect those values into sets first:
values_per_key = {}
for d in iterable_of_dicts:
for k, v in d.items():
values_per_key.setdefault(k, set()).add(v)
counts = {k: len(v) for k, v in values_per_key.items()}
which for your input, produces:
>>> values_per_key = {}
>>> for d in iterable_of_dicts:
... for k, v in d.items():
... values_per_key.setdefault(k, set()).add(v)
...
>>> counts = {k: len(v) for k, v in values_per_key.items()}
>>> counts
{'abc': 3, 'xyz': 1, 'pqr': 4}
We can still wrap that object in a Counter()
instance if you want to make use of the additional functionality this class offers, see below:
>>> from collections import Counter
>>> Counter(counts)
Counter({'pqr': 4, 'abc': 3, 'xyz': 1})
The downside is that if your input iterable is very large the above approach can require a lot of memory. In case you don't need exact counts, e.g. when orders of magnitude suffice, there are other approaches, such as a hyperloglog structure or other algorithms that 'sketch out' a count for the stream.
This approach requires you install a 3rd-party library. As an example, the datasketch
project offers both HyperLogLog and MinHash. Here's a HLL example (using the HyperLogLogPlusPlus
class, which is a recent improvement to the HLL approach):
from collections import defaultdict
from datasketch import HyperLogLogPlusPlus
counts = defaultdict(HyperLogLogPlusPlus)
for d in iterable_of_dicts:
for k, v in d.items():
counts[k].update(v.encode('utf8'))
In a distributed setup, you could use Redis to manage the HLL counts.
My original answer:
Use a collections.Counter()
instance, together with some chaining:
from collections import Counter
from itertools import chain
counts = Counter(chain.from_iterable(e.keys() for e in d))
This ensures that dictionaries with more than one key in your input list are counted correctly.
Demo:
>>> from collections import Counter
>>> from itertools import chain
>>> d = [{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
or with multiple keys in the input dictionaries:
>>> d = [{"abc":"movies", 'xyz': 'music', 'pqr': 'music'}, {"abc": "sports", 'pqr': 'movies'}, {"abc": "music", 'pqr': 'sports'}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
A Counter()
has additional, helpful functionality, such as the .most_common()
method that lists elements and their counts in reverse sorted order:
for key, count in counts.most_common():
print '{}: {}'.format(key, count)
# prints
# 5: pqr
# 3: abc
# 1: xyz
What you're describing--a list with multiple values for each key--would be better visualized by something like this:
{'abc': ['movies', 'sports', 'music'],
'xyz': ['music'],
'pqr': ['music', 'movies', 'sports', 'news']
}
In that case, you have to do a bit more work to insert:
- Lookup key to see if it already exists
- If doesn't exist, create new key with value
[]
(empty list)
- If doesn't exist, create new key with value
- Retrieve value (the list associated with the key)
- Use
if value in
to see if the value being checked exists in the list - If the new value isn't in,
.append()
it
This also leads to an easy way to count the total number of elements stored:
# Pseudo-code
for myKey in myDict.keys():
print "{0}: {1}".format(myKey, len(myDict[myKey])
No need of using counter. You can achieve in this way:
# input dictionary
d=[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
# fetch keys
b=[j[0] for i in d for j in i.items()]
# print output
for k in list(set(b)):
print "{0}: {1}".format(k, b.count(k))