Hello. In case you have really huge data, like 100 gigs, that does not fit in memory, what would be a good way to normalize input data? I thought of getting the estimated mean and variance from each batch for training, but then I wasn’t sure how I can normalize test data as we want to use same mean and variance for test set as well. Thank you!

Hi there,

usually scaling is done on a training set only as a good practice. You can take a representative (and sufficiently large) sample to calculate \sigma and \mu. Using the maximum likelihood method should be sufficient to have a reasonable estimate of the both parameters that determine your normal distribution resp. the scaling / normalisation according to the \sqrt{n} law (law of large numbers).

If you want to use the exactly correct values of your (training) data for scaling, you can go for

min/max scaling since it would be easily possible to remember the extreme values when loading new data into your memory. Afterwards min/max scaling could be applied to the whole dataset, again loading several times your data in the memory or doing it in a parallelised distributed setting. This whole process would correspond to \Theta(n) complexity.

I think in both approaches you could get reasonable scaling results. In my experience having arbitrarily exact / accurate scaling is not too important. After all, you want to make sure that your features are in a comparable, reasonable way to have a nice training process and run gradient descent more effectively without biasing the algorithm to high amplitude features.

What do you think?

Best regards

Christian

I very much agree with @Christian_Simonis’s approach to estimate it on a large and representative samples.

To get the exact normalization parameters, the min-max normalization requires us to go through the whole dataset once. The standard normalization requires twice.

Raymond

that makes total sense for me! Thank you so much for your detailed explanation