Questions on normalizing really huge data

user422 · December 26, 2022, 4:45pm

Hello. In case you have really huge data, like 100 gigs, that does not fit in memory, what would be a good way to normalize input data? I thought of getting the estimated mean and variance from each batch for training, but then I wasn’t sure how I can normalize test data as we want to use same mean and variance for test set as well. Thank you!

Christian_Simonis · December 26, 2022, 5:22pm

Hi there,

usually scaling is done on a training set only as a good practice. You can take a representative (and sufficiently large) sample to calculate \sigma and \mu. Using the maximum likelihood method should be sufficient to have a reasonable estimate of the both parameters that determine your normal distribution resp. the scaling / normalisation according to the \sqrt{n} law (law of large numbers).

If you want to use the exactly correct values of your (training) data for scaling, you can go for
min/max scaling since it would be easily possible to remember the extreme values when loading new data into your memory. Afterwards min/max scaling could be applied to the whole dataset, again loading several times your data in the memory or doing it in a parallelised distributed setting. This whole process would correspond to \Theta(n) complexity.

I think in both approaches you could get reasonable scaling results. In my experience having arbitrarily exact / accurate scaling is not too important. After all, you want to make sure that your features are in a comparable, reasonable way to have a nice training process and run gradient descent more effectively without biasing the algorithm to high amplitude features.

What do you think?

Best regards
Christian

rmwkwok · December 27, 2022, 2:19am

I very much agree with @Christian_Simonis’s approach to estimate it on a large and representative samples.

To get the exact normalization parameters, the min-max normalization requires us to go through the whole dataset once. The standard normalization requires twice.

Raymond

user422 · December 27, 2022, 4:01am

that makes total sense for me! Thank you so much for your detailed explanation

Topic		Replies	Views
Input data normalization Improving Deep Neural Networks: Hyperparameter tun	2	698	July 12, 2022
Batch Normalization Improving Deep Neural Networks: Hyperparameter tun	3	551	May 31, 2021
Optimization methods vs normalizing input features Improving Deep Neural Networks: Hyperparameter tun	5	701	February 26, 2022
DLS, Course 2, Week 1, Question to normalizing Inputs Improving Deep Neural Networks: Hyperparameter tun	7	484	April 27, 2023
Week 3 normalization 2 questions Improving Deep Neural Networks: Hyperparameter tun	3	556	January 9, 2022

Questions on normalizing really huge data

Related topics