Efficient Dictionary Searching?
You can also simply use
if key in data_dict:
instead of
if key in data_dict.keys():
As mentioned, the first is a direct hash lookup - the intended offset is computed directly, and then checked - it is roughly O(1), whereas the check of keys is a linear search, which is O(n).
In [258]: data_dict = dict([(x, x) for x in range(100000)])
In [259]: %timeit 999999 in data_dict.keys()
100 loops, best of 3: 3.47 ms per loop
In [260]: %timeit 999999 in data_dict
10000000 loops, best of 3: 49.3 ns per loop
data_dict.keys()
returns an unsorted list of keys in the dictionary. Thus each time you check to see if a given key is in the dictionary, you're doing a linear search across the list of keys (an O(n) operation). The longer your list, the longer it takes to search for a given key.
Contrast this to data_dict[keyStr]
. This performs a hash lookup, which is an O(1) operation. It doesn't (directly) depend on the number of keys in the dictionary; even as you add more keys, the time to check if a given key is in the dictionary stays constant.
As several others have noted, the problem lies in the fact that key in data_dict.keys()
uses the unordered list
returned from the keys()
method (in Python 2.x), which takes linear time O(n) to search through, meaning that the running time increases linearly with the dictionary's size, plus generating the list of keys itself will take longer and longer as the size goes up.
On the other hand, key in data_dict
only requires constant time O(1), on average, to perform a search regardless of the size of the dictionary because internally it does a hash table look-up. In addition, this hash table already exists since its part of the internal representation of dictionaries, and therefore doesn't have to be generated before using it.
Python doesn't do this automatically because the in
operator only knows the type of its two operands, not their sources, so it can't automatically optimize the first case where all it sees is the key and a list.
However, in this case the issue of search speed can probably be avoided altogether by storing the data in a specialized version of a dictionary called a defaultdict
found in the built-in collections
module. Here's how your code might look if you used one:
from collections import defaultdict
input_data = defaultdict(float) # (guessing factory type)
...
data_dict[keyStr] = input_data[keyStr] + load_val
When there's no preexisting entry for input_data[keyStr]
one will be generated automatically with a default value (0.0
for float
in this example). As you can see, the code is shorter and very likely faster, all without the need for any if
tests or exception handling.
The problem is that for every test you're generating a new list of keys with .keys()
. As the list of keys gets longer, the time required goes up. Also as noted by dckrooney, the search for the key becomes linear instead of taking advantage of the hash-table structure of the dictionary.
Replace with:
if key in data_dict: