scikit grid search over multiple classifiers

Although the solution from dubek is more straight forward, it does not help with interactions between parameters of pipeline elements that come before the classfier. Therefore, I have written a helper class to deal with it, and can be included in the default Pipeline setting of scikit. A minimal example:

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from pipelinehelper import PipelineHelper

iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
pipe = Pipeline([
    ('scaler', PipelineHelper([
        ('std', StandardScaler()),
        ('max', MaxAbsScaler()),
    ])),
    ('classifier', PipelineHelper([
        ('svm', LinearSVC()),
        ('rf', RandomForestClassifier()),
    ])),
])

params = {
    'scaler__selected_model': pipe.named_steps['scaler'].generate({
        'std__with_mean': [True, False],
        'std__with_std': [True, False],
        'max__copy': [True],  # just for displaying
    }),
    'classifier__selected_model': pipe.named_steps['classifier'].generate({
        'svm__C': [0.1, 1.0],
        'rf__n_estimators': [100, 20],
    })
}
grid = GridSearchCV(pipe, params, scoring='accuracy', verbose=1)
grid.fit(X_iris, y_iris)
print(grid.best_params_)
print(grid.best_score_)

It can also be used for other elements of the pipeline, not just the classifier. Code is on github if anyone wants to check it out.

Edit: I have published this on PyPI if anyone is interested, just install ti using pip install pipelinehelper.

Instead of using Grid Search for hyperparameter selection, you can use the 'hyperopt' library.

Please have a look at section 2.2 of this page. In the above case, you can use an hp.choice expression to select among the various pipelines and then define the parameter expressions for each one separately.

In your objective function, you need to have a check depending on the pipeline chosen and return the CV score for the selected pipeline and parameters (possibly via cross_cal_score).

The trials object at the end of the execution, will indicate the best pipeline and parameters overall.

This is how I did it without a wrapper function. You can evaluate any number of classifiers. Each one can have multiple parameters for hyperparameter optimization.

The one with best score will be saved to disk using pickle

from sklearn.svm import SVC
from operator import itemgetter
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
import operator

#pipeline parameters
    parameters = \
        [ \
            {
                'clf': [MultinomialNB()],
                'tf-idf__stop_words': ['english', None],
                'clf__alpha': [0.001, 0.1, 1, 10, 100]
            },

            {
                'clf': [SVC()],
                'tf-idf__stop_words': ['english', None],
                'clf__C': [0.001, 0.1, 1, 10, 100, 10e5],
                'clf__kernel': ['linear', 'rbf'],
                'clf__class_weight': ['balanced'],
                'clf__probability': [True]
            },

            {
                'clf': [DecisionTreeClassifier()],
                'tf-idf__stop_words': ['english', None],
                'clf__criterion': ['gini','entropy'],
                'clf__splitter': ['best','random'],
                'clf__class_weight':['balanced', None]
            }
        ]

    #evaluating multiple classifiers
    #based on pipeline parameters
    #-------------------------------
    result=[]

    for params in parameters:

        #classifier
        clf = params['clf'][0]

        #getting arguments by
        #popping out classifier
        params.pop('clf')

        #pipeline
        steps = [('tf-idf', TfidfVectorizer()), ('clf',clf)]

        #cross validation using
        #Grid Search
        grid = GridSearchCV(Pipeline(steps), param_grid=params, cv=3)
        grid.fit(features, labels)

        #storing result
        result.append\
        (
            {
                'grid': grid,
                'classifier': grid.best_estimator_,
                'best score': grid.best_score_,
                'best params': grid.best_params_,
                'cv': grid.cv
            }
        )

    #sorting result by best score
    result = sorted(result, key=operator.itemgetter('best score'),reverse=True)

    #saving best classifier
    grid = result[0]['grid']
    joblib.dump(grid, 'classifier.pickle')

scikit grid search over multiple classifiers

Tags:

Python

Scikit Learn

Related

Recent Posts