Questions on normalizing really huge data

Christian_Simonis · December 26, 2022, 5:22pm

Hi there,

usually scaling is done on a training set only as a good practice. You can take a representative (and sufficiently large) sample to calculate \sigma and \mu. Using the maximum likelihood method should be sufficient to have a reasonable estimate of the both parameters that determine your normal distribution resp. the scaling / normalisation according to the \sqrt{n} law (law of large numbers).

If you want to use the exactly correct values of your (training) data for scaling, you can go for
min/max scaling since it would be easily possible to remember the extreme values when loading new data into your memory. Afterwards min/max scaling could be applied to the whole dataset, again loading several times your data in the memory or doing it in a parallelised distributed setting. This whole process would correspond to \Theta(n) complexity.

I think in both approaches you could get reasonable scaling results. In my experience having arbitrarily exact / accurate scaling is not too important. After all, you want to make sure that your features are in a comparable, reasonable way to have a nice training process and run gradient descent more effectively without biasing the algorithm to high amplitude features.

What do you think?

Best regards
Christian

Topic		Replies	Views
Input data normalization Improving Deep Neural Networks: Hyperparameter tun	2	694	July 12, 2022
Batch Normalization Improving Deep Neural Networks: Hyperparameter tun	3	550	May 31, 2021
Optimization methods vs normalizing input features Improving Deep Neural Networks: Hyperparameter tun	5	687	February 26, 2022
DLS, Course 2, Week 1, Question to normalizing Inputs Improving Deep Neural Networks: Hyperparameter tun	7	484	April 27, 2023
Week 3 normalization 2 questions Improving Deep Neural Networks: Hyperparameter tun	3	556	January 9, 2022

Questions on normalizing really huge data

Related topics