multi-column factorize in pandas
You need to create a ndarray of tuple first, pandas.lib.fast_zip
can do this very fast in cython loop.
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
the output is:
[0 1 2 2 1 0]
I am not sure if this is an efficient solution. There might be better solutions for this.
arr=[] #this will hold the unique items of the dataframe
for i in df.index:
if list(df.iloc[i]) not in arr:
arr.append(list(df.iloc[i]))
so printing the arr would give you
>>>print arr
[[1,1],[1,2],[2,2]]
to hold the indices, i would declare an ind array
ind=[]
for i in df.index:
ind.append(arr.index(list(df.iloc[i])))
printing ind would give
>>>print ind
[0,1,2,2,1,0]