Questions on normalizing really huge data

Hello. In case you have really huge data, like 100 gigs, that does not fit in memory, what would be a good way to normalize input data? I thought of getting the estimated mean and variance from each batch for training, but then I wasn’t sure how I can normalize test data as we want to use same mean and variance for test set as well. Thank you!

Hi there,

usually scaling is done on a training set only as a good practice. You can take a representative (and sufficiently large) sample to calculate \sigma and \mu. Using the maximum likelihood method should be sufficient to have a reasonable estimate of the both parameters that determine your normal distribution resp. the scaling / normalisation according to the \sqrt{n} law (law of large numbers).

If you want to use the exactly correct values of your (training) data for scaling, you can go for
min/max scaling since it would be easily possible to remember the extreme values when loading new data into your memory. Afterwards min/max scaling could be applied to the whole dataset, again loading several times your data in the memory or doing it in a parallelised distributed setting. This whole process would correspond to \Theta(n) complexity.

I think in both approaches you could get reasonable scaling results. In my experience having arbitrarily exact / accurate scaling is not too important. After all, you want to make sure that your features are in a comparable, reasonable way to have a nice training process and run gradient descent more effectively without biasing the algorithm to high amplitude features.

What do you think?

Best regards

1 Like

I very much agree with @Christian_Simonis’s approach to estimate it on a large and representative samples.

To get the exact normalization parameters, the min-max normalization requires us to go through the whole dataset once. The standard normalization requires twice.



that makes total sense for me! Thank you so much for your detailed explanation :slight_smile:

1 Like