Size of sample in Random Forest Regression
Uhh, I agree with you it's quite strange that we cannot specify the subsample/bootstrap size in RandomForestRegressor
algo. Maybe a potential workaround is to use BaggingRegressor
instead. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor
RandomForestRegressor
is just a special case of BaggingRegressor
(use bootstraps to reduce the variance of a set of low-bias-high-variance estimators). In RandomForestRegressor
, the base estimator is forced to be DeceisionTree
, whereas in BaggingRegressor
, you have the freedom to choose the base_estimator
. More importantly, you can set your customized subsample size, for example max_samples=0.5
will draw random subsamples with size equal to half of the entire training set. Also, you can choose just a subset of features by setting max_features
and bootstrap_features
.
The sample size for bootstrap is always the number of samples.
You are not missing anything, the same question was asked on the mailing list for RandomForestClassifier
:
The bootstrap sample size is always the same as the input sample size. If you feel up to it, a pull request updating the documentation would probably be quite welcome.
In the 0.22 version of scikit-learn, the max_samples
option has been added, doing what you asked : here the documentation of the class.