pandas: Combining Multiple Categories into One
I certainly don't see an issue with @DSM's original answer here, but that dictionary comprehension might not be the easiest thing to read for some (although is a fairly standard approach in Python).
If you don't want to use a dictionary comprehension but are willing to use numpy
then I would suggest np.select
which is roughly as concise as @DSM's answer but perhaps a little more straightforward to read, like @vector07's answer.
import numpy as np
number = [ df.numbers.isin([3,4,5]),
df.numbers.isin([1,6,7]),
df.numbers.isin([2,8,9,10]),
df.numbers.isin([11]) ]
color = [ "red", "green", "blue", "purple" ]
df.numbers = np.select( number, color )
Output (note this is a string or object column, but of course you can easily convert to a category with astype('category')
:
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
It's basically the same thing, but you could also do this with np.where
:
df['numbers2'] = ''
df.numbers2 = np.where( df.numbers.isin([3,4,5]), "red", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([1,6,7]), "green", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([2,8,9,10]), "blue", df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([11]), "purple", df.numbers2 )
That's not going to be as efficient as np.select
which is probably the most efficient way to do this (although I didn't time it), but it is arguably more readable in that you can put each key/value pair on the same line.
OK, this is slightly simpler, hopefully will stimulate further conversation.
OP's example input:
>>> my_data = {'numbers': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
>>> df = pd.DataFrame(data=my_data)
>>> df.numbers = df.numbers.astype('category')
>>> df.numbers.cat.rename_categories(['green','blue','red', 'red', 'red'
>>> 'green', 'green', 'blue', 'blue' 'blue'])
This yields ValueError: Categorical categories must be unique
as OP states.
My solution:
# write out a dict with the mapping of old to new
>>> remap_cat_dict = {
1: 'green',
2: 'blue',
3: 'red',
4: 'red',
5: 'red',
6: 'green',
7: 'green',
8: 'blue',
9: 'blue',
10: 'blue' }
>>> df.numbers = df.numbers.map(remap_cat_dict).astype('category')
>>> df.numbers
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: numbers, dtype: category
Categories (3, object): [blue, green, red]
Forces you to write out a complete dict with 1:1 mapping of old categories to new, but is very readable. And then the conversion is pretty straightforward: use df.apply by row (implicit when .apply is used on a dataseries) to take each value and substitute it with the appropriate result from the remap_cat_dict. Then convert result to category and overwrite the column.
I encountered almost this exact problem where I wanted to create a new column with less categories converrted over from an old column, which works just as easily here (and beneficially doesn't involve overwriting a current column):
>>> df['colors'] = df.numbers.map(remap_cat_dict).astype('category')
>>> print(df)
numbers colors
0 1 green
1 2 blue
2 3 red
3 4 red
4 5 red
5 6 green
6 7 green
7 8 blue
8 9 blue
9 10 blue
>>> df.colors
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: colors, dtype: category
Categories (3, object): [blue, green, red]
EDIT 5/2/20: Further simplified df.numbers.apply(lambda x: remap_cat_dict[x])
with df.numbers.map(remap_cat_dict)
(thanks @JohnE)
Not sure about elegance, but if you make a dict of the old to new categories, something like (note the added 'purple'):
>>> m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10], "purple": [11]}
>>> m2 = {v: k for k,vv in m.items() for v in vv}
>>> m2
{1: 'green', 2: 'blue', 3: 'red', 4: 'red', 5: 'red', 6: 'green',
7: 'green', 8: 'blue', 9: 'blue', 10: 'blue', 11: 'purple'}
You can use this to build a new categorical Series:
>>> df.cat.map(m2).astype("category", categories=set(m2.values()))
0 green
1 blue
2 red
3 red
4 red
5 green
6 green
7 blue
8 blue
9 blue
Name: cat, dtype: category
Categories (4, object): [green, purple, red, blue]
You don't need the categories=set(m2.values())
(or an ordered equivalent if you care about the categorical ordering) if you're sure that all categorical values will be seen in the column. But here, if we didn't do that, we wouldn't have seen purple
in the resulting Categorical, because it was building it from the categories it actually saw.
Of course if you already have your list ['green','blue','red', etc.]
built it's equally easy just to use it to make a new categorical column directly and bypass this mapping entirely.
Seems pandas.explode
released with pandas-0.25.0
(July 18, 2019)
would fit right in there and hence avoid any looping -
# Mapping dict
In [150]: m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}
In [151]: pd.Series(m).explode().sort_values()
Out[151]:
green 1
blue 2
red 3
red 4
red 5
green 6
green 7
blue 8
blue 9
blue 10
dtype: object
So, the result is a pandas series that has all the required mappings from values:index
. Now, based on user-requirements, we might use it directly or if needed in different formats like dict or series, swap index and values. Let's explore those too.
# Mapping obtained
In [152]: s = pd.Series(m).explode().sort_values()
1) Output as dict :
In [153]: dict(zip(s.values, s.index))
Out[153]:
{1: 'green',
2: 'blue',
3: 'red',
4: 'red',
5: 'red',
6: 'green',
7: 'green',
8: 'blue',
9: 'blue',
10: 'blue'}
2) Output as series :
In [154]: pd.Series(s.index, s.values)
Out[154]:
1 green
2 blue
3 red
4 red
5 red
6 green
7 green
8 blue
9 blue
10 blue
dtype: object