Unbalanced classification using RandomForestClassifier in sklearn
You can pass sample weights argument to Random Forest fit method
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.
In older version there were a preprocessing.balance_weights
method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.
Update
Some clarification, as you seems to be confused. sample_weight
usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X
as observations and y
as classes (labels), then len(X) == len(y) == len(sample_wight)
, and each element of sample witght
1-d array represent weight for a corresponding (observation, label)
pair. For your case, if 1
class is represented 5 times as 0
class is, and you balance classes distributions, you could use simple
sample_weight = np.array([5 if i == 0 else 1 for i in y])
assigning weight of 5
to all 0
instances and weight of 1
to all 1
instances. See link above for a bit more crafty balance_weights
weights evaluation function.
This is really a shame that sklearn's "fit" method does not allow specifying a performance measure to be optimized. No one around seem to understand or question or be interested in what's actually going on when one calls fit method on data sample when solving a classification task.
We (users of the scikit learn package) are silently left with suggestion to indirectly use crossvalidated grid search with specific scoring method suitable for unbalanced datasets in hope to stumble upon a parameters/metaparameters set which produces appropriate AUC or F1 score.
But think about it: looks like "fit" method called under the hood each time always optimizes accuracy. So in end effect, if we aim to maximize F1 score, GridSearchCV gives us "model with best F1 from all modesl with best accuracy". Is that not silly? Would not it be better to directly optimize model's parameters for maximal F1 score? Remember old good Matlab ANNs package, where you can set desired performance metric to RMSE, MAE, and whatever you want given that gradient calculating algo is defined. Why is choosing of performance metric silently omitted from sklearn?
At least, why there is no simple option to assign class instances weights automatically to remedy unbalanced datasets issues? Why do we have to calculate wights manually? Besides, in many machine learning books/articles I saw authors praising sklearn's manual as awesome if not the best sources of information on topic. No, really? Why is unbalanced datasets problem (which is obviously of utter importance to data scientists) not even covered nowhere in the docs then? I address these questions to contributors of sklearn, should they read this. Or anyone knowing reasons for doing that welcome to comment and clear things out.
UPDATE
Since scikit-learn 0.17, there is class_weight='balanced' option which you can pass at least to some classifiers:
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).