Scikit-learn balanced subsampling
I found the best solutions here
And this is the one I think it's the simplest.
dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)
then you can use X_rus, y_rus data
For versions 0.4<=:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)
Then, indices of the samples randomly selected can be reached by sample_indices_
attribute.
There now exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn
Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)
This function creates single random balanced subsample.
edit: The subsample size now samples down minority classes, this should probably be changed.
def balanced_subsample(x,y,subsample_size=1.0):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:
Replace the
np.random.shuffle
line withthis_xs = this_xs.reindex(np.random.permutation(this_xs.index))
Replace the
np.concatenate
lines withxs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')