-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardization of test data in Lab 6 should use training mean and standard deviation #11
Comments
Is it true though? I never managed to convince myself that this is not
right to use the mean and std of the whole datasets. It is obvious that we
should not use the test set mean and std but I never managed to prove that
using the whole dataset is harmful (and I never seen a proof anywhere). It
seems to be an accepted precaution. On the contrary, I have many examples
that normalize/std-ize in the train and apply to rest can lead to many
problems. Think a large dataset, where train (and test) are just a small
subset. Pavlos
…On Sat, Jul 21, 2018 at 10:50 AM covuworie ***@***.***> wrote:
Observed behavior
Hi, there is a bug in classification-and-pca-lab.ipynb
<https://github.com/cs109/a-2017/blob/master/Labs/Lab6_Classification_PCA/classification-and-pca-lab.ipynb>
for Lab 6 in the do_classify method. When standardizing the testing data,
its mean and standard deviation are used. This is incorrect for several
reasons such as:
- No information from the testing data should be used in the model
prediction as it is a form of *data snooping*. The testing dataset has
been contaminated by this.
- The same variable is not being created during the transformation of
the training and testing sets
Expected behavior
The training data mean and standard deviation should be used for
standardizing the testing data like so:
dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
I think this was mentioned in one of the earlier lectures and here are
some more references:
-
https://stats.stackexchange.com/questions/202287/why-standardization-of-the-testing-set-has-to-be-performed-with-the-mean-and-sd
- https://sebastianraschka.com/faq/docs/scale-training-test.html
-
https://www.researchgate.net/post/If_I_used_data_normalization_x-meanx_stdx_for_training_data_would_I_use_train_Mean_and_Standard_Deviation_to_normalize_test_data
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#11>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFwvU87oZVCBcaDxACPS4GtBbN8PKxARks5uI02vgaJpZM4VZq4N>
.
--
Pavlos Protopapas
-----------------
Scientific Program Director, Institute for Applied Computational Science
Harvard School of Engineering and Applied Sciences
Maxwell Dworkin, 33 Oxford Street
Cambridge, MA 02138
http://iacs.seas.harvard.edu/
[email protected] | 617-496-2611
|
Hi Pavlos, Thanks for the response. Am I missing something here? As you say, "It is obvious that we itrain, itest = train_test_split(range(subdf.shape[0]), train_size=train_size)
if standardize:
dftrain=(subdf.iloc[itrain] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
dftest=(subdf.iloc[itest] - subdf.iloc[itest].mean())/subdf.iloc[itest].std() The same is also done in cell 20 in the Now referring to whether it is correct to use use the mean and std deviation of the whole dataset. As the Sebastian Raschka link above says:
In this case there are only 212 observations in the training set and 142 observations in the test set which is not a lot (especially compared with 63 predictors). I think the main point the various authors are making is one of data leakage / data snooping when the entire training set mean and std are used. The example that is used in the article mentioned above makes a lot of sense:
Yes I agree that in practice it may not make much of a difference compared to using the training set mean and standard deviation if the sample size is large and they observations are drawn independently from the same distribution. Yes we could check this before deciding. But why even take the chance? I think the answer to this question provides a great explanation and also links to further reputable resources which discuss the issue: Chuk |
Observed behavior
Hi, there are bugs in classification-and-pca-lab.ipynb for
Lab 6
in thedo_classify
andclassify_from_dataframe
methods. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:Expected behavior
The training data mean and standard deviation should be used for standardizing the testing data like so:
I think this was mentioned in one of the earlier lectures and here are some more references:
The text was updated successfully, but these errors were encountered: