Random Forest with bootstrap = False in scikit-learn python

According to this definition [1]

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

Note: sub-sample size is always the same

But the samples are drawn with replacement if bootstrap=True (default).

So Bootstrap=True (default): samples are drawn with replacement Bootstrap=False : samples are drawn without replacement

[2] In sampling without replacement, each sample unit of the population has only one chance to be selected in the sample. For example, if one draws a simple random sample such that no unit occurs more than one time in the sample, the sample is drawn without replacement.

Visually you could imagine that from a bag of balls (samples), you pick M.

bag of balls

That constitutes your subset number 1, with M balls.

Now, if you trow the balls inside the bag before you pick up another M for your subset 2 then you do "draw with replacement" (bootstrap=True)

But, if you put the subset 1 aside and pick up another M balls from the bag for your subset 2, then none of the balls in subset 1 can be in subset 2 (or any other subset) because you "draw without replacement" (bootstrap=False)

[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

[2] http://methods.sagepub.com/Reference//encyclopedia-of-survey-research-methods/n516.xml


It seems like you're conflating the bootstrap of your observations with the sampling of your features. An Introduction to Statistical Learning provides a really good introduction to Random Forests.

The benefit of random forests comes from its creating a large variety of trees by sampling both observations and features. Bootstrap = False is telling it to sample observations with or without replacement - it should still sample when it's False, just without replacement.

You tell it what share of features you want to sample by setting max_features, either to a share of the features or just an integer number (and this is something that you would typically tune to find the best parameter for).

It will be fine that you're not going to have every day when you're building each tree - that's where the value of RF comes from. Each individual tree will be a pretty bad predictor, but when you average together the predictions from hundreds or thousands of trees you'll (probably) end up with a good model.


I don't have the reputation to comment. So I will just post my opinion here. The scikit-learn documentation says the sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). So if bootstrap = FALSE, I think every sub-sample is just as same as the original input sample.