Closest equivalent of a factor variable in Python Pandas
If you're looking to map a categorical variable to a number as R does, Pandas implemented a function that will give you just that: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html
import pandas as pd
df = pd.read_csv('path_to_your_file')
df['new_factor'], _ = pd.factorize(df['old_categorical'], sort=True)
This function returns both the enumerated mapping as well as a list of unique values. If you're just doing variable assignment, you'll have to throw the latter away as above.
If you want a homegrown solution, you can use a combination of a set and a dictionary within a function. This method is a bit easier to apply over multiple columns, but you do have to note that None, NaN, etc. will be a included as a category in this method:
def factor(var):
var_set = set(var)
var_set = {x: y for x, y in [pair for pair in zip(var_set, range(len(var_set)))]}
return [var_set[x] for x in var]
df['new_factor1'] = df['old_categorical1'].apply(factor)
df[['new_factor2', 'new_factor3']] = df[['old_categorical2', 'old_categorical3']].apply(factor)
This question seems to be from a year back but since it is still open here's an update. pandas has introduced a categorical
dtype and it operates very similar to factors
in R. Please see this link for more information:
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
Reproducing a snippet from the link above showing how to create a "factor" variable in pandas.
In [1]: s = Series(["a","b","c","a"], dtype="category")
In [2]: s
Out[2]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a < b < c]
If you're looking to do modeling etc, lots of goodies for factor within the patsy library. I will admit to having struggled with this myself. I found these slides helpful. Wish I could give a better example, but this is as far as I've gotten myself.