Python: Collections.Counter vs defaultdict(int)
Both Counter
and defaultdict(int)
can work fine here, but there are few differences between them:
Counter
supports most of the operations you can do on a multiset. So, if you want to use those operation then go for Counter.Counter
won't add new keys to the dict when you query for missing keys. So, if your queries include keys that may not be present in the dict then better useCounter
.
Example:
>>> c = Counter()
>>> d = defaultdict(int)
>>> c[0], d[1]
(0, 0)
>>> c
Counter()
>>> d
defaultdict(<type 'int'>, {1: 0})
Example:
Counter
also has a method calledmost_common
that allows you to sort items by their count. To get the same thing indefaultdict
you'll have to usesorted
.
Example:
>>> c = Counter('aaaaaaaaabbbbbbbcc')
>>> c.most_common()
[('a', 9), ('b', 7), ('c', 2)]
>>> c.most_common(2) #return 2 most common items and their counts
[('a', 9), ('b', 7)]
Counter
also allows you to create a list of elements from the Counter object.
Example:
>>> c = Counter({'a':5, 'b':3})
>>> list(c.elements())
['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b']
So, depending on what you want to do with the resulting dict you can choose between Counter
and defaultdict(int)
.
I support using defaultdict(int)
for summing counts, such as in this case, and Counter()
for counting list elements. In your case, the following would be the cleanest solution:
name_count = [
("Lucy", 1),
("Bob", 5),
("Jim", 40),
("Susan", 6),
("Lucy", 2),
("Bob", 30),
("Harold", 6)
]
aggregate_counts = defaultdict(int)
for name, count in name_count:
aggregate_counts[name] += count
defaultdict(int)
seems to work more faster.
In [1]: from collections import Counter, defaultdict
In [2]: def test_counter():
...: c = Counter()
...: for i in range(10000):
...: c[i] += 1
...:
In [3]: def test_defaultdict():
...: d = defaultdict(int)
...: for i in range(10000):
...: d[i] += 1
...:
In [4]: %timeit test_counter()
5.28 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit test_defaultdict()
2.31 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)