how to get pandas get_dummies to emit N-1 variables to avoid collinearity?

There are a number of ways of doing so.

Possibly the simplest is replacing one of the values by None before calling get_dummies. Say you have:

import pandas as pd
import numpy as np
s = pd.Series(list('babca'))
>> s
0    b
1    a
2    b
3    c
4    a

Then use:

>> pd.get_dummies(np.where(s == s.unique()[0], None, s))
    a   c
0   0   0
1   1   0
2   0   0
3   0   1
4   1   0

to drop b.

(Of course, you need to consider if your category column doesn't already contain None.)


Another way is to use the prefix argument to get_dummies:

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False)

prefix: string, list of strings, or dict of strings, default None - String to append DataFrame column names Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternativly, prefix can be a dictionary mapping column names to prefixes.

This will append some prefix to all of the resulting columns, and you can then erase one of the columns with this prefix (just make it unique).


Pandas version 0.18.0 implemented exactly what you're looking for: the drop_first option. Here's an example:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: u'0.18.1'

In [3]: s = pd.Series(list('abcbacb'))

In [4]: pd.get_dummies(s, drop_first=True)
Out[4]: 
     b    c
0  0.0  0.0
1  1.0  0.0
2  0.0  1.0
3  1.0  0.0
4  0.0  0.0
5  0.0  1.0
6  1.0  0.0