How to penalize False Negatives more than False Positives
As @Maxim mentioned, there are 2 stages to make this kind of tuning: in the model training stage (like custom weights) and the prediction stage (like lowering the decision threshold).
Another tuning for the model-training stage is using a recall scorer. you can use it in your grid-search cross-validation (GridSearchCV) for tuning your classifier with the best hyper-param towards high recall.
GridSearchCV scoring parameter can either accepts the 'recall' string or the function recall_score.
Since you're using a binary classification, both options should work out of the box, and call recall_score with its default values that suits a binary classification:
- average: 'binary' (i.e. one simple recall value)
- pos_label: 1 (like numpy's True value)
Should you need to custom it, you can wrap an existing scorer, or a custom one, with make_scorer, and pass it to the scoring parameter.
For example:
from sklearn.metrics import recall_score, make_scorer
recall_custom_scorer = make_scorer(
lambda y, y_pred, **kwargs: recall_score(y, y_pred, pos_label='yes')[1]
)
GridSearchCV(estimator=est, param_grid=param_grid, scoring=recall_custom_scorer, ...)
There are several options for you:
As suggested in the comments,
class_weight
should boost the loss function towards the preferred class. This option is supported by various estimators, includingsklearn.linear_model.LogisticRegression
,sklearn.svm.SVC
,sklearn.ensemble.RandomForestClassifier
, and others. Note there's no theoretical limit to the weight ratio, so even if 1 to 100 isn't strong enough for you, you can go on with 1 to 500, etc.You can also select the decision threshold very low during the cross-validation to pick the model that gives highest recall (though possibly low precision). The recall close to
1.0
effectively meansfalse_negatives
close to0.0
, which is what to want. For that, usesklearn.model_selection.cross_val_predict
andsklearn.metrics.precision_recall_curve
functions:y_scores = cross_val_predict(classifier, x_train, y_train, cv=3, method="decision_function") precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
If you plot the
precisions
andrecalls
against thethresholds
, you should see the picture like this:After picking the best threshold, you can use the raw scores from
classifier.decision_function()
method for your final classification.
Finally, try not to over-optimize your classifier, because you can easily end up with a trivial const classifier (which is obviously never wrong, but is useless).