How to maintain dictionary in a heap in python?
Using heap is a best solution with time complexity: O(nlogk). where n is length of the heap and k is 10 here.
Now the trick with mapping of keys is that we can create another class for comparison of key and define magic methods __lt__()
__gt__()
. which overrides < , > operators
import heapq
class CompareWord:
def __init__(self , word , value):
self.word = word
self.value = value
def __lt__(self, other): #To override > operator
return self.value < other.value
def __gt__(self , other): #To override < operator
return self.value > other.value
def getWord(self):
return self.word
def findKGreaterValues(compare_dict , k):
min_heap = []
for word in compare_dict:
heapq.heappush(min_heap , CompareWord(word ,compare_dict[word] ))
if(len(min_heap) > k):
heapq.heappop(min_heap)
answer = []
for compare_word_obj in min_heap:
answer.append(compare_word_obj.getWord())
return answer
Using heapq
you probably want to do something like this:
heap = [(-value, key) for key,value in the_dict.items()]
largest = heapq.nsmallest(10, heap)
largest = [(key, -value) for value, key in largest]
Note that since heapq
implements only a min heap it's better to invert the values, so that bigger values become smaller.
This solution will be slower for small sizes of the heap, for example:
>>> import random
>>> import itertools as it
>>> def key_generator():
... characters = [chr(random.randint(65, 90)) for x in range(100)]
... for i in it.count():
... yield ''.join(random.sample(characters, 3))
...
>>> the_dict = dict((key, random.randint(-500, 500)) for key, _ in zip(key_generator(), range(3000)))
>>> def with_heapq(the_dict):
... items = [(-value, key) for key, value in the_dict.items()]
... smallest = heapq.nsmallest(10, items)
... return [-value for value, key in smallest]
...
>>> def with_sorted(the_dict):
... return sorted(the_dict.items(), key=(lambda x: x[1]), reverse=True)[:10]
...
>>> import timeit
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
0.9220538139343262
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
1.2792410850524902
With 3000 values it's just slightly faster than the sorted
version, which is O(nlogn)
instead of O(n + mlogn)
. If we increase the size of the dict to 10000 the heapq
version becomes even faster:
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
2.436316967010498
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
3.585728168487549
The timings probably depends also on the machine on which you are running. You should probably profile which solution works best in your case. If the efficiency is not critical I'd suggest to use the sorted
version because it's simpler.
For getting the top 10 elements, assuming that the number is in the second place:
from operator import itemgetter
topten = sorted(mydict.items(), key=itemgetter(1), reverse = True)[0:10]
if you want to sort by value then key just change it to key=itemgetter(1,0)
.
As for a data structure, a heap sounds like what you would want. Just keep them as tuples, and compare the number term.