How to speed up nested cross validation in python?
The Dask-ML has scalable implementations GridSearchCV
and RandomSearchCV
that are, I believe, drop in replacements for Scikit-Learn. They were developed alongside Scikit-Learn developers.
- https://ml.dask.org/hyper-parameter-search.html
They can be faster for two reasons:
- They avoid repeating shared work between different stages of a Pipeline
- They can scale out to a cluster anywhere you can deploy Dask (which is easy on most cluster infrastructure)
Two things:
Instead of
GridSearch
try usingHyperOpt
- it's a Python library for serial and parallel optimization.I would reduce the dimensionality by using UMAP or PCA. Probably UMAP is the better choice.
After you apply SMOTE
:
import umap
dim_reduced = umap.UMAP(
min_dist=min_dist,
n_neighbors=neighbours,
random_state=1234,
).fit_transform(smote_output)
And then you can use dim_reduced
for the train test split.
Reducing the dimensionality will help to remove noise from the data and instead of dealing with 25 features you'll bring them down to 2 (using UMAP) or the number of components you choose (using PCA). Which should have significant implications on the performance.
There is an easy win in your situation and that is .... start using parallel processing :). dask
will help you if you have a cluster (it will work on a single machine, but the improvement compared to the default scheduling in sklearn
is not significant), but if you plan to run it on a single machine (but have several cores/threads and "enough" memory) then you can run nested CV in parallel. The only trick is that sklearn
will not allow you to run the outer CV loop in multiple processes. However, it will allow you to run the inner loop in multiple threads.
At the moment you have n_jobs=None
in the outer CV loop (that's the default in cross_val_score
), which means n_jobs=1
and that's the only option that you can use with sklearn
in the nested CV.
However, you can achieve and easy gain by setting n_jobs=some_reasonable_number
in all GridSearchCV
that you use. some_reasonable_number
does not have to be -1
(but it is a good starting point). Some algorithms either plateau on n_jobs=n_cores
instead of n_threads
(for example, xgboost
) or already have built-in multi-processing (like RandomForestClassifier
, for example) and there might be clashes if you spawn too many processes.