Hi there,

usually scaling is done on a training set only as a good practice. You can take a representative (and sufficiently large) sample to calculate \sigma and \mu. Using the maximum likelihood method should be sufficient to have a reasonable estimate of the both parameters that determine your normal distribution resp. the scaling / normalisation according to the \sqrt{n} law (law of large numbers).

If you want to use the exactly correct values of your (training) data for scaling, you can go for

min/max scaling since it would be easily possible to remember the extreme values when loading new data into your memory. Afterwards min/max scaling could be applied to the whole dataset, again loading several times your data in the memory or doing it in a parallelised distributed setting. This whole process would correspond to \Theta(n) complexity.

I think in both approaches you could get reasonable scaling results. In my experience having arbitrarily exact / accurate scaling is not too important. After all, you want to make sure that your features are in a comparable, reasonable way to have a nice training process and run gradient descent more effectively without biasing the algorithm to high amplitude features.

What do you think?

Best regards

Christian