How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?
Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.
When you use the StandardScaler
as a step inside a Pipeline
then scikit-learn will internally do the job for you.
What happens can be described as follows:
- Step 0: The data are split into
TRAINING data
andTEST data
according to thecv
parameter that you specified in theGridSearchCV
. - Step 1: the
scaler
is fitted on theTRAINING data
- Step 2: the
scaler
transformsTRAINING data
- Step 3: the models are fitted/trained using the transformed
TRAINING data
- Step 4: the
scaler
is used to transform theTEST data
- Step 5: the trained models
predict
using thetransformed TEST data
Note: You should be using grid.fit(X, y)
and NOT grid.fit(X_train, y_train)
because the GridSearchCV
will automatically split the data into training and testing data (this happen internally).
Use something like this:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)
Once you run this code (when you call grid.fit(X, y)
), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_
member provides access to the best score observed during the optimization procedure and the best_params_
describes the combination of parameters that achieved the best results.
IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:
X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation
= train_test_split(X, y, test_size=0.15, random_state=1)
Then use:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)