Dummy variables when not all categories are present

I did ask this on the pandas github. Turns out it is really easy to get around it when you define the column as a Categorical where you define all the possible categories.

df['col'] = pd.Categorical(df['col'], categories=['a', 'b', 'c', 'd'])

get_dummies() will do the rest then as expected.

Try this:

In[1]: import pandas as pd
       cats = ["a", "b", "c"]

In[2]: df = pd.DataFrame({"cat": ["a", "b", "a"]})

In[3]: pd.concat((pd.get_dummies(df.cat, columns=cats), pd.DataFrame(columns=cats))).fillna(0)
Out[3]: 
     a    b    c
0  1.0  0.0  0
1  0.0  1.0  0
2  1.0  0.0  0

TL;DR:

pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))

Older pandas: pd.get_dummies(cat.astype('category', categories=categories))

is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummies takes into account. Here's an example:

In [1]: import pandas as pd

In [2]: possible_categories = list('abc')

In [3]: dtype = pd.CategoricalDtype(categories=possible_categories)

In [4]: cat = pd.Series(list('aba'), dtype=dtype)
In [5]: cat
Out[5]: 
0    a
1    b
2    a
dtype: category
Categories (3, object): [a, b, c]

Then, get_dummies will do exactly what you want!

In [6]: pd.get_dummies(cat)
Out[6]: 
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0

There are a bunch of other ways to create a categorical Series or DataFrame, this is just the one I find most convenient. You can read about all of them in the pandas documentation.

EDIT:

I haven't followed the exact versioning, but there was a bug in how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).

For version 0.17.0, if you try to do this with the sparse=True option with a DataFrame, the column of zeros for the missing dummy variable will be a column of NaN, and it will be converted to dense.

It looks like pandas 0.21.0 added a CategoricalDType, and creating categoricals which explicitly include the categories as in the original answer was deprecated, I'm not quite sure when.

Using transpose and reindex

import pandas as pd

cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})

dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)

print dummies

    a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0

Dummy variables when not all categories are present

Tags:

Python

Pandas

Machine Learning

Dummy Variable

Related

Recent Posts