I will share my view about standardization.
I think the idea behind standardization is to standardize the scales of all input features in order to simplify the learning algorithm.
Considering non-standardized features, since each weight in the neural network is coupled with each input feature to a different degree, it can end up requiring us to provide very different gradient descent step for each weight in order for the training to be efficient. In other words, we need to manually provide a good learning rate for each weight, if we cannot figure out how to assign them automatically.
This post has a MLS video screenshot which illustrates the problem. In the upper example, due to non-standardized features, the gradient contour looks elliptical so that it requires a smaller step in the horizontal direction but a larger step in the vertical direction in order for the gradient descent to NOT oscillate (being inefficient).
Bringing this idea to a more general neural network which can easily has millions of weights, we can easily see how infeasible it would be to assign such a large number of learning rates by ourselves. However, if we standardize the features, such as the lower example in the same screenshot that the gradient contour becomes a circle which requires only the same step size for all directions, even in the case of millions of weights, the problem of “different step sizes” would be largely reduced (I don’t mean to be completely eliminated) so that having one learning rate for all weights is feasible.
If learning is feasible with only one learning rate instead of having to be millions of learning rates, this has simplified the learning algorithm.
Cheers,
Raymond