Normalize data before or after split of training and testing data?
In the specific setting of a train/test split, we need to distinguish between two transformations:
- transformations that change the value of an observation (row) according to information about a feature (column) and
- transformations that change the value of an observation according to information about that observation alone.
Two common examples of (1) are mean-centering (subtracting the mean of the feature) or scaling to unit variance (dividing by the standard deviation). Subtracting the mean and dividing by the standard deviation is a common transformation. In sklearn
, it is implemented in sklearn.preprocessing.StandardScaler
. Importantly, this is not the same as Normalizer
. See below for exhaustive detail.
An example of (2) is transforming a feature by taking the logarithm, or raising each value to a power (e.g. squaring).
Transformations of the first type are best applied to the training data, with the centering and scaling values retained and applied to the test data afterwards. This is because using information about the test set to train the model may bias model comparison metrics to be overly optimistic. This can result in over-fitting & selection of a bogus model.
Transformations of the second type can be applied without regard to train/test splits, because the modified value of each observation depends only on the data about the observation itself, and not on any other data or observation(s).
This question has garnered some misleading answers. The rest of this answer is dedicated to showing how and why they are misleading.
The term "normalization" is ambiguous, and different authors and disciplines will use the term "normalization" in different ways. In the absence of a specific articulation of what "normalization" means, I think it's best to approach the question in the most general sense possible.
In this view, the question is not about sklearn.preprocessing.Normalizer
specifically. Indeed, the Normalizer
class is not mentioned in the question. For that matter, no software, programming language or library is mentioned, either. Moreover, even if the intent is to ask about Normalizer
, the answers are still misleading because they mischaracterize what Normalizer
does.
Even within the same library, the terminology can be inconsistent. For example, PyTorch implements normalize torchvision.transforms.Normalize
and torch.nn.functional.normalize
. One of these can be used to create output tensors with mean 0 and standard deviation 1, while the other creates outputs that have a norm of 1.
What the Normalizer
Class Does
The Normalizer
class is an example of (2) because it rescales each observation (row) individually so that the sum-of-squares is 1 for every row. (In the corner-case that a row has sum-of-squares equal to 0, no rescaling is done.) The first sentence of the documentation for the Normalizer
says
Normalize samples individually to unit norm.
This simple test code validates this understanding:
X = np.arange(10).reshape((5, 2))
normalizer = preprocessing.Normalizer()
normalized_all_X = normalizer.transform(X)
sum_of_squares = np.square(normalized_all_X).sum(1)
print(np.allclose(sum_of_squares,np.ones_like(sum_of_squares)))
This prints True
because the result is an array of 1s, as described in the documentation.
The normalizer implements fit
, transform
and fit_transform
methods even though some of these are just "pass-through" methods. This is so that there is a consistent interface across preprocessing methods, not because the methods' behaviors needs to distinguish between different data partitions.
Misleading Presentation 1
The Normalizer
class does not subtract the column means
Another answer writes:
Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance.
Ok, so let's try this out. Using the code snippet from the answer, we have
X = np.arange(10).reshape((5, 2))
X_train = X[:3]
X_test = X[3:]
normalizer = preprocessing.Normalizer()
normalized_train_X = normalizer.fit_transform(X_train)
column_means_train_X = normalized_train_X.mean(0)
This is the value of column_means_train_X
. It is not zero!
[0.42516214 0.84670847]
If the column means had been subtracted from the columns, then the centered column means would be 0.0. (This is simple to prove. The sum of n
numbers x=[x1,x2,x3,...,xn]
is S
. The mean of those numbers is S / n
. Then we have sum(x - S/n) = S - n * (S / n) = 0
.)
We can write similar code to show that the columns have not been divided by the variance. (Neither have the columns been divided by the standard deviation, which would be the more usual choice).
Misleading Presentation 2
Applying the Normalizer
class to the whole data set does not change the result.
If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).
This claim is true as far as it goes, but it has absolutely no bearing on the Normalizer
class. Indeed, Giorgos Myrianthous's chosen example is actually immune to the effect that they are describing.
If the Normalizer
class did involve the means of the features, then we would expect that the normalize results will change depending on which of our data are included in the training set.
For example, the sample mean is a weighted sum of every observation in the sample. If we were computing column means and subtracting them, the results of applying this to all of the data would differ from applying it to only the training data subset. But we've already demonstrated that Normalizer
doesn't subtract column means.
Furthermore, these tests show that applying Normalizer
to all of the data or just some of the data makes no difference for the results.
If we apply this method separately, we have
[[0. 1. ]
[0.5547002 0.83205029]
[0.62469505 0.78086881]]
[[0.65079137 0.7592566 ]
[0.66436384 0.74740932]]
And if we apply it together, we have
[[0. 1. ]
[0.5547002 0.83205029]
[0.62469505 0.78086881]
[0.65079137 0.7592566 ]
[0.66436384 0.74740932]]
where the only difference is that we have 2 arrays in the first case, due to partitioning. Let's just double-check that the combined arrays are the same:
normalized_train_X = normalizer.fit_transform(X_train)
normalized_test_X = normalizer.transform(X_test)
normalized_all_X = normalizer.transform(X)
assert np.allclose(np.vstack((normalized_train_X, normalized_test_X)),normalized_all_X )
No exception is raised; they're numerically identical.
But sklearn's transformers are sometimes stateful, so let's make a new object just to make sure this isn't some state-related behavior.
new_normalizer = preprocessing.Normalizer()
new_normalized_all_X = new_normalizer.fit_transform(X)
assert np.allclose(np.vstack((normalized_train_X, normalized_test_X)),new_normalized_all_X )
In the second case, we still have no exception raised.
We can conclude that for the Normalizer
class, it makes no difference if the data are partitioned or not.
you can use fit then transform learn
normalizer = preprocessing.Normalizer().fit(xtrain)
transform
xtrainnorm = normalizer.transform(xtrain)
xtestnorm = normalizer.transform(Xtest)
Ask yourself if your data will look different depending on whether you transform before or after your split. If you're doing a log2
transformation, the order doesn't matter because each value is transformed independently of the others. If you're scaling and centering your data, the order does matter because an outlier can drastically change the final distribution. You're allowing the test set to "spill over" and affect your training set, potentially causing overly optimistic performance measures.
For R
uses, the caret
package is good at handling test/train splits. You can add the argument preProcess = c("scale", "center")
to the train
function and it will automatically apply any transformation from the training data onto the test data.
Tl;dr - if the data is different depending on whether your normalize before or after your split, do it before
You first need to split the data into training and test set (validation set could be useful too).
Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).
Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.
For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell
As an example, assuming we have the following data:
>>> import numpy as np
>>>
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
where X
represents our features:
>>> X
[[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]
and Y
contains the corresponding label
>>> list(y)
>>> [0, 1, 2, 3, 4]
Step 1: Create training/testing sets
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
>>> X_train
[[4 5]
[0 1]
[6 7]]
>>>
>>> X_test
[[2 3]
[8 9]]
>>>
>>> y_train
[2, 0, 3]
>>>
>>> y_test
[1, 4]
Step 2: Normalise training data
>>> from sklearn import preprocessing
>>>
>>> normalizer = preprocessing.Normalizer()
>>> normalized_train_X = normalizer.fit_transform(X_train)
>>> normalized_train_X
array([[0.62469505, 0.78086881],
[0. , 1. ],
[0.65079137, 0.7592566 ]])
Step 3: Normalize testing data
>>> normalized_test_X = normalizer.transform(X_test)
>>> normalized_test_X
array([[0.5547002 , 0.83205029],
[0.66436384, 0.74740932]])