GridSearchCV extremely slow on small dataset in scikit-learn
As noted already,
for SVM
-based Classifiers ( as y == np.int*
)
preprocessing is a must, otherwise the ML-Estimator's prediction capability is lost right by skewed features' influence onto a decission function.
As objected the processing times:
- try to get better view what is your AI/ML-Model Overfit/Generalisation
[C,gamma]
landscape - try to add verbosity into the initial AI/ML-process tuning
- try to add n_jobs into the number crunching
- try to add Grid Computing move into your computation approach if scale requires
.
aGrid = aML_GS.GridSearchCV( aClassifierOBJECT, param_grid = aGrid_of_parameters, cv = cv, n_jobs = n_JobsOnMultiCpuCores, verbose = 5 )
Sometimes, the GridSearchCV()
can indeed take a huge amount of CPU-time / CPU-poolOfRESOURCEs, even after all the above mentioned tips are used.
So, keep calm and do not panic, if you are sure the Feature-Engineering, data-sanity & FeatureDOMAIN preprocessing was done correctly.
[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.761619 -62.7min
[GridSearchCV] C=16777216.0, gamma=0.5 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.792793 -64.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.793103 -116.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.794603 -205.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.771772 -200.9min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.713643 -446.0min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.743628 -184.6min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.761261 -281.2min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=4.0, score=0.670165 -138.7min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.760120 -97.3min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.732733 -66.3min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.755622 -13.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.772114 - 4.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.717718 -14.7min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.763118 - 1.3min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.746627 - 25.4s
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.738739 - 44.9s
[Parallel(n_jobs=1)]: Done 2700 out of 2700 | elapsed: 5670.8min finished
As have asked above about "... a regular svm.SVC().fit
"
kindly notice,
it uses default [C,gamma]
values and thus have no relevance to behaviour of your Model / ProblemDOMAIN.
Re: Update
oh yes indeed, regularisation/scaling of SVM-inputs is a mandatory task for this AI/ML tool.
scikit-learn has a good instrumentation to produce and re-use aScalerOBJECT
for both a-priori scaling ( before aDataSET
goes into .fit()
) & ex-post ad-hoc scaling, once you need to re-scale a new example and send it to the predictor to answer it's magic
via a request to anSvmCLASSIFIER.predict( aScalerOBJECT.transform( aNewExampleX ) )
( Yes, aNewExampleX
may be a matrix, so asking for a "vectorised" processing of several answers )
Performance relief of O( M 2 . N 1 ) computational complexity
In contrast to the below posted guess, that the Problem-"width", measured as N
== a number of SVM-Features in matrix X
is to be blamed for an overall computing time, the SVM classifier with rbf-kernel is by-design an O( M 2 . N 1 ) problem.
So, there is quadratic dependence on the overall number of observations ( examples ), moved into a Training ( .fit()
) or CrossValidation phase and one can hardly state, that the supervised learning classifier will get any better predictive power if one "reduces" the ( linear only ) "width" of features, that per se bear the inputs into the constructed predictive power of the SVM-classifier, don't they?
Support Vector Machines are sensitive to scaling. It is most likely that your SVC is taking a longer time to build an individual model. GridSearch is basically a brute force method which runs the base models with different parameters. So, if your GridSearchCV is taking time to build, it is more likely due to
- Large number of parameter combinations (Which is not the case here)
- Your individual model takes a lot of time.