Using the predict_proba() function of RandomForestClassifier in the safe and right way
A RandomForestClassifier
is a collection of DecisionTreeClassifier
's. No matter how big your training set, a decision tree simply returns: a decision. One class has probability 1, the other classes have probability 0.
The RandomForest simply votes among the results. predict_proba()
returns the number of votes for each class (each tree in the forest makes its own decision and chooses exactly one class), divided by the number of trees in the forest. Hence, your precision is exactly 1/n_estimators
. Want more "precision"? Add more estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000
estimators, which is excessive. You normally don't want more than 100 estimators, and often not that many.
I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit, but try to print
predictions[0,0]
.I am not sure to understand what you mean by "the probabilities aren't affected by the size of my data". If your concern is that you don't want to predict, eg, too many spams, what is usually done is to use a threshold
t
such that you predict 1 ifproba(label==1) > t
. This way you can use the threshold to balance your predictions, for example to limit the global probabilty of spams. And if you want to globally analyse your model, we usually compute the Area under the curve (AUC) of the Receiver operating characteristic (ROC) curve (see wikipedia article here). Basically the ROC curve is a description of your predictions depending on the thresholdt
.
Hope it helps!