scikit-learn classification on soft labels
According to the docs,
The ‘log’ loss gives logistic regression, a probabilistic classifier.
In general a loss function is of the form Loss( prediction, target )
, where prediction
is the model's output, and target
is the ground-truth value. In the case of logistic regression, prediction
is a value on (0,1)
(i.e., a "soft label"), while target
is 0
or 1
(i.e., a "hard label").
So in answer to your question, it depends on if you are referring to the prediction
or target
. Generally speaking, the form of the labels ("hard" or "soft") is given by the algorithm chosen for prediction
and by the data on hand for target
.
If your data has "hard" labels, and you desire a "soft" label output by your model (which can be thresholded to give a "hard" label), then yes, logistic regression is in this category.
If your data has "soft" labels, then you would have to choose a threshold to convert them to "hard" labels before using typical classification methods (i.e., logistic regression). Otherwise, you could use a regression method where the model is fit to predict the "soft" target. In this latter approach, your model could give values outside of (0,1)
, and this would have to be handled.
I recently had this problem and came up with a nice fix that seems to work.
Basically, transform your targets to log-odds-ratio space using the inverse sigmoid function. Then fit a linear regression. Then, to do inference, take the sigmoid of the predictions from the linear regression model.
So say we have soft targets/labels y ∈ (0, 1)
(make sure to clamp the targets to say [1e-8, 1 - 1e-8]
to avoid instability issues when we take logs).
We take the inverse sigmoid, then we fit a linear regression (assuming predictor variables are in matrix X
):
y = np.clip(y, 1e-8, 1 - 1e-8) # numerical stability
inv_sig_y = np.log(y / (1 - y)) # transform to log-odds-ratio space
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, inv_sig_y)
Then to make predictions:
def sigmoid(x):
ex = np.exp(x)
return ex / (1 + ex)
preds = sigmoid(lr.predict(X_new))
This seems to work, at least for my use case. My guess is that it's not far off what happens behind the scenes for LogisticRegression anyway.
Bonus: this also seems to work with other regression models in sklearn
, e.g. RandomForestRegressor
.