list comprehension to build a nested dictionary from a list of tuples
Two Tuple Keys
I would suggest abandoning the idea of nesting dictionaries and simply use two tuples as the keys directly. Like so:
d = { (user_id, analysis_type_id): count for count, analysis_type_id, user_id in counts}
The dictionary is a hash table. In python, each two tuple has a single hash value (not two hash values) and thus each two tuple is looked up based on its (relatively) unique hash. Therefore this is faster (2x faster, most of the time) than looking up the hash of TWO separate keys (first the user_id
, then the analysis_type_id
).
However, beware of premature optimization. Unless you're doing millions of lookups, the increase in performance of the flat dict
is unlikely to matter. The real reason to favor the use of the two tuple here is that the syntax and readability of a two tuple solution is far superior than other solutions- that is, assuming the vast majority of the time you will be wanting to access items based on a pair of values and not groups of items based on a single value.
Consider Using a namedtuple
It may be convenient to create a named tuple for storing those keys. Do that this way:
from collections import namedtuple
IdPair = namedtuple("IdPair", "user_id, analysis_type_id")
Then use it in your dictionary comprehension:
d = { IdPair(user_id, analysis_type_id): count for count, analysis_type_id, user_id in counts}
And access a count you're interested in like this:
somepair = IdPair(user_id = 4, analysis_type_id = 1)
d[somepair]
The reason this is sometimes useful is you can do things like this:
user_id = somepair.user_id # very nice syntax
Some Other Useful Options
One downside of the above solution is the case in which your lookup fails. In that case, you will only get a traceback like the following:
>>> d[IdPair(0,0)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: IdPair(user_id=0, analysis_type_id=0)
This isn't very helpful; was it the user_id
that was unmatched, or the analysis_type_id
, or both?
You can create a better tool for yourself by creating your own dict
type that gives you a nice traceback with more information. It might look something like this:
class CountsDict(dict):
"""A dict for storing IdPair keys and count values as integers.
Provides more detailed traceback information than a regular dict.
"""
def __getitem__(self, k):
try:
return super().__getitem__(k)
except KeyError as exc:
raise self._handle_bad_key(k, exc) from exc
def _handle_bad_key(self, k, exc):
"""Provides a custom exception when a bad key is given."""
try:
user_id, analysis_type_id = k
except:
return exc
has_u_id = next((True for u_id, _ in self if u_id==user_id), False)
has_at_id = next((True for _, at_id in self if at_id==analysis_type_id), False)
exc_lookup = {(False, False):KeyError(f"CountsDict missing pair: {k}"),
(True, False):KeyError(f"CountsDict missing analysis_type_id: "
f"{analysis_type_id}"),
(False, True):KeyError(f"CountsDict missing user_id: {user_id}")}
return exc_lookup[(user_id, analysis_type_id)]
Use it just like a regular dict
.
However, it may make MORE sense to simply add new pairs to your dict
(with a count of zero) when you try to access a missing pair. If this is the case, I'd use a defaultdict
and have it set the count to zero (using the default value of int
as the factory function) when a missing key is accessed. Like so:
from collections import defaultdict
my_dict = defaultdict(default_factory=int,
((user_id, analysis_type_id), count) for count, analysis_type_id, user_id in counts))
Now if you attempt to access a key that is missing, the count will be set to zero. However, one problem with this method is that ALL keys will be set to zero:
value = my_dict['I'm not a two tuple, sucka!!!!'] # <-- will be added to my_dict
To prevent this, we go back to the idea of making a CountsDict
, except in this case, your special dict
will be a subclass of defaultdict
. However, unlike a regular defaultdict
, it will check to make sure the key is a valid kind before it is added. And as a bonus, we can make sure ANY two tuple that is added as a key becomes an IdPair
.
from collections import defaultdict
class CountsDict(defaultdict):
"""A dict for storing IdPair keys and count values as integers.
Missing two-tuple keys are converted to an IdPair. Invalid keys raise a KeyError.
"""
def __getitem__(self, k):
try:
user_id, analysis_type_id = k
except:
raise KeyError(f"The provided key {k!r} is not a valid key.")
else:
# convert two tuple to an IdPair if it was not already
k = IdPair(user_id, analysis_type_id)
return super().__getitem__(k)
Use it just like the regular defaultdict
:
my_dict = CountsDict(default_factory=int,
((user_id, analysis_type_id), count) for count, analysis_type_id, user_id in counts))
NOTE: In the above I have not made it so that two tuple keys are converted to IdPair
s upon instance creation (because __setitem__
is not utilized during instance creation). To create this functionality, we would also need to implement an override of the __init__
method.
Wrap Up
Out of all of these, the more useful option depends entirely on your use case.
The most readable solution utilizes a defaultdict
which saves you nested loops and bumpy checking if keys already exist:
from collections import defaultdict
dct = defaultdict(dict) # do not shadow the built-in 'dict'
for x, y, z in counts:
dct[z][y] = x
dct
# defaultdict(dict, {4: {1: 4, 5: 3, 10: 2}, 5: {10: 2}})
If you really want a one-liner comprehension you can use itertools.groupby
and this clunkiness:
from itertools import groupby
dct = {k: {y: x for x, y, _ in g} for k, g in groupby(sorted(counts, key=lambda c: c[2]), key=lambda c: c[2])}
If your initial data is already sorted by user_id, you can save yourself the sorting.