What is itertools.groupby() used for?
To start with, you may read the documentation here.
I will place what I consider to be the most important point first. I hope the reason will become clear after the examples.
ALWAYS SORT ITEMS WITH THE SAME KEY TO BE USED FOR GROUPING SO AS TO AVOID UNEXPECTED RESULTS
itertools.groupby(iterable, key=None or some func)
takes a list of iterables and groups them based on a specified key. The key specifies what action to apply to each individual iterable, the result of which is then used as the heading for each grouping the items; items which end up having same 'key' value will end up in the same group.
The return value is an iterable similar to a dictionary in that it is of the form {key : value}
.
Example 1
# note here that the tuple counts as one item in this list. I did not
# specify any key, so each item in the list is a key on its own.
c = groupby(['goat', 'dog', 'cow', 1, 1, 2, 3, 11, 10, ('persons', 'man', 'woman')])
dic = {}
for k, v in c:
dic[k] = list(v)
dic
results in
{1: [1, 1],
'goat': ['goat'],
3: [3],
'cow': ['cow'],
('persons', 'man', 'woman'): [('persons', 'man', 'woman')],
10: [10],
11: [11],
2: [2],
'dog': ['dog']}
Example 2
# notice here that mulato and camel don't show up. only the last element with a certain key shows up, like replacing earlier result
# the last result for c actually wipes out two previous results.
list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
'wombat', 'mongoose', 'malloo', 'camel']
c = groupby(list_things, key=lambda x: x[0])
dic = {}
for k, v in c:
dic[k] = list(v)
dic
results in
{'c': ['camel'],
'd': ['dog', 'donkey'],
'g': ['goat'],
'm': ['mongoose', 'malloo'],
'persons': [('persons', 'man', 'woman')],
'w': ['wombat']}
Now for the sorted version
# but observe the sorted version where I have the data sorted first on same key I used for grouping
list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
'wombat', 'mongoose', 'malloo', 'camel']
sorted_list = sorted(list_things, key = lambda x: x[0])
print(sorted_list)
print()
c = groupby(sorted_list, key=lambda x: x[0])
dic = {}
for k, v in c:
dic[k] = list(v)
dic
results in
['cow', 'cat', 'camel', 'dog', 'donkey', 'goat', 'mulato', 'mongoose', 'malloo', ('persons', 'man', 'woman'), 'wombat']
{'c': ['cow', 'cat', 'camel'],
'd': ['dog', 'donkey'],
'g': ['goat'],
'm': ['mulato', 'mongoose', 'malloo'],
'persons': [('persons', 'man', 'woman')],
'w': ['wombat']}
Example 3
things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "harley"), \
("vehicle", "speed boat"), ("vehicle", "school bus")]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
dic[key] = list(group)
dic
results in
{'animal': [('animal', 'bear'), ('animal', 'duck')],
'plant': [('plant', 'cactus')],
'vehicle': [('vehicle', 'harley'),
('vehicle', 'speed boat'),
('vehicle', 'school bus')]}
Now for the sorted version. I changed the tuples to lists here. Same results either way.
things = [["animal", "bear"], ["animal", "duck"], ["vehicle", "harley"], ["plant", "cactus"], \
["vehicle", "speed boat"], ["vehicle", "school bus"]]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
dic[key] = list(group)
dic
results in
{'animal': [['animal', 'bear'], ['animal', 'duck']],
'plant': [['plant', 'cactus']],
'vehicle': [['vehicle', 'harley'],
['vehicle', 'speed boat'],
['vehicle', 'school bus']]}
As always the documentation of the function should be the first place to check. However itertools.groupby
is certainly one of the trickiest itertools
because it has some possible pitfalls:
It only groups the items if their
key
-result is the same for successive items:from itertools import groupby for key, group in groupby([1,1,1,1,5,1,1,1,1,4]): print(key, list(group)) # 1 [1, 1, 1, 1] # 5 [5] # 1 [1, 1, 1, 1] # 4 [4]
One could use
sorted
before - if one wants to do an overallgroupby
.It yields two items, and the second one is an iterator (so one needs to iterate over the second item!). I explicitly needed to cast these to a
list
in the previous example.The second yielded element is discarded if one advances the
groupby
-iterator:it = groupby([1,1,1,1,5,1,1,1,1,4]) key1, group1 = next(it) key2, group2 = next(it) print(key1, list(group1)) # 1 []
Even if
group1
isn't empty!
As already mentioned one can use sorted
to do an overall groupby
operation but that's extremely inefficient (and throws away the memory-efficiency if you want to use groupby on generators). There are better alternatives available if you can't guarantee that the input is sorted
(which also don't require the O(n log(n))
sorting time overhead):
collections.defaultdict
iteration_utilities.groupedby
- probably more.
However it's great to check local properties. There are two recipes in the itertools
-recipes section:
def all_equal(iterable):
"Returns True if all the elements are equal to each other"
g = groupby(iterable)
return next(g, True) and not next(g, False)
and:
def unique_justseen(iterable, key=None):
"List unique elements, preserving order. Remember only the element just seen."
# unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
# unique_justseen('ABBCcAD', str.lower) --> A B C A D
return map(next, map(itemgetter(1), groupby(iterable, key)))