scikit-learn classification on soft labels

According to the docs,

The ‘log’ loss gives logistic regression, a probabilistic classifier.

In general a loss function is of the form Loss( prediction, target ), where prediction is the model's output, and target is the ground-truth value. In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").

So in answer to your question, it depends on if you are referring to the prediction or target. Generally speaking, the form of the labels ("hard" or "soft") is given by the algorithm chosen for prediction and by the data on hand for target.

If your data has "hard" labels, and you desire a "soft" label output by your model (which can be thresholded to give a "hard" label), then yes, logistic regression is in this category.

If your data has "soft" labels, then you would have to choose a threshold to convert them to "hard" labels before using typical classification methods (i.e., logistic regression). Otherwise, you could use a regression method where the model is fit to predict the "soft" target. In this latter approach, your model could give values outside of (0,1), and this would have to be handled.


I recently had this problem and came up with a nice fix that seems to work.

Basically, transform your targets to log-odds-ratio space using the inverse sigmoid function. Then fit a linear regression. Then, to do inference, take the sigmoid of the predictions from the linear regression model.

So say we have soft targets/labels y ∈ (0, 1) (make sure to clamp the targets to say [1e-8, 1 - 1e-8] to avoid instability issues when we take logs).

We take the inverse sigmoid, then we fit a linear regression (assuming predictor variables are in matrix X):

y = np.clip(y, 1e-8, 1 - 1e-8)   # numerical stability
inv_sig_y = np.log(y / (1 - y))  # transform to log-odds-ratio space

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, inv_sig_y)

Then to make predictions:

def sigmoid(x):
    ex = np.exp(x)
    return ex / (1 + ex)

preds = sigmoid(lr.predict(X_new))

This seems to work, at least for my use case. My guess is that it's not far off what happens behind the scenes for LogisticRegression anyway.

Bonus: this also seems to work with other regression models in sklearn, e.g. RandomForestRegressor.

Tags:

Scikit Learn