I wanted to check if my thoughts are on the right track about feature scaling. Suppose I have n features \mathbf{x} = \langle x_1,\ldots,x_n\rangle and m samples. Let \mu _j and \sigma_j denote the mean and standard deviation of feature x_j. The cost function J(\mathbf{w},b)_{\mathbf{x}} in either linear or logistic regressions comes down to \mathbf{w} = \langle w_1,\ldots, w_n\rangle, b, and the linear terms \mathbf{w}\cdot\mathbf{x}^{(i)} + b for i=1,\ldots m.

Let’s say the scaled features are x_{1}'\ldots, x'_n, where each x_j' = \frac{x_j-\mu_j}{\sigma_j}. In this case, letting w'_j = \sigma_jw_j and b'= b+\sum_{j=1}^m w_j\mu_j, the cost J(\mathbf{w}',b')_{\mathbf{x}'} equals J(\mathbf{w},b)_{\mathbf{x}} since (\mathbf{x},\mathbf{w},b) produces the same predictions as (\mathbf{x}',\mathbf{w}',b'). This means the minimum (if exists) for both function J(\mathbf{w},b)_{\mathbf{x}}, J(\mathbf{w}',b')_{\mathbf{x}'} are the same. Namely, the minimum of the costs with and without normalizations are the same (where the minimum is attained is different of course).

So methinks theoretically the minimizing cost part is the same with or without normalizations. The benefits of normalization is faster convergence and bigger learning rates when doing gradient descent?? Specifically gradient descent should converge to similar cost values, except it can be much more slower without normalization? Or is it possible that without normalizations, the batch gradient descent is never going to produce minimum?? (Let’s assume our functions are convex for now.)

The benefits of normalization is faster convergence and bigger learning rates when doing gradient, also Specifically gradient descent should converge to similar cost values, except it can be much more slower without normalization, as it’s just change the converge path like this vedio

In addition to the above the gradient descent with normaliztion can prevent local minimum cost area like this post

thanks for the effective visualization! I actually went back and did an experiment, where I first did gradient descent on the scaled features, to get \mathbf{w}', b', and apply the formula mentioned in my original post to get \mathbf{w}, b. It turns out the \mathbf{w},b has the exact prediction and cost on the unscaled features \mathbf{x}. Meanwhile, it seems impossible to get to these \mathbf{w},b with gradient descent on the original feature. I gave up after 4000 iterations, where the learning rate is around 10^{-13}. Adjustment like 2*10^{-13} already gave diverent results. Very difficult to fine tune the learning rate so that it descends faster.

Indeed! If everyone did the same experiments, we all probably would be more impressed by scaling. Without it, the ranges in each feature dimension differ by a lot. Consequently, we need a small enough learning rate to cater for the largest range in order to avoid divergence. However, it can be like never moving for other features having much smaller ranges.

From these conversations I am confused again. I had an older post about scaling using mean and standard deviation across all features. Should not this reduce the difference between data points the most? Yet in my experiments done here doing this actually results in a much slower convergence…

My guess is that normalizing across all features probably mars the distinct contribution from each feature. This is very non-precise unfortunately.

Sorry for causing confusion. Yes, normalization would go across all features. I am just highlighting how different the original scales of features can be, which would reinforce the importance of scaling.

@Juan_Olano ,
thanks for answering. What I meant was take the single mean across all data points, and take the single standard deviation across all data points, then normalize all data points using the single mean and standard deviation.

Imagine we have two features x_1, x_2. Normally (pun?) we calculate std \sigma_1, \sigma_2, and mean \mu_1, \mu_2 for x_1, x_2 respectively.

What I meant was computing \mu, \sigma from points across x_1 and x_2, then normalize as \frac{x_1-\mu}{\sigma}, \frac{x_2-\mu}{\sigma}.

Great discussion. I just want to add two resources that help me to understand more about feature scaling, and hopefully it helps you as well, they are not exhaustive explanation about feature scaling but does help to understand their importance, some warnings and use case.

I think @Juan_Olano has provided us a very good example to illustrate the idea behind scaling. Here, the Pricing spans over a range of 900K and has an order of magnitude of 10^5, whereas the Rooms only 4 with an order of magnitude of 10^0.

while error_i is common to all features, x_i is the real factor that makes the scale of the gradient of one feature different from another feature. For Pricing, its scale is on average 10^5, and for Rooms it is 10^0, so we can expect for a 10^{(5-0)} difference in the order of magnitude in the gradients as well. Given that we use the same learning rate for all features, we should expect to see the weight for Pricing to change more dramatically than Rooms.

In your “old note”, you have talked about “Flat normalization” and “Column-wise normalization”. I think all of us are referring to the latter in this discussion.

To @Juan_Olano: I totally agree, and to @rmwkwok: Less yes, not entirely no either.

To me the flattened normalization across all features should provide the most uniform gradient. In you explanation the factor x_i seems to be controlled in flattened normalization as well. In my experiment the prediction is still good, except much slower. The learning rate there has been tuned also. So the confusion is why much slower?

So to summarize the item of the scope of normalization, we would normalize each feature independently, and at the end all would be in the 0-to-1 range, and done like this, each feature would preserve its distribution.

Yes, it is controlled, but think about this, there will still be order of magnitude differences between the features after the flat normalization. It is just that the overall order of magnitude for every feature becomes smaller than before the flat normalization is applied. Agree? If the overall order of magnitude is smaller, then it’s natural that some learning rate that would not work before the flat normalization can become feasible.

Perhaps the table below will summarize well what one can expect to see in terms of order of magnitude?

Feature

No normalization

Flat normalization

column-wise normalization

Pricing

10^5

10^0

10^0

Rooms

10^0

10^{-5}

10^0

I think one general idea is that, if there is a significant order-of-magnitude difference among features, we need a small enough learning rate to avoid divergence and because of that, the learning process will be slow.

YASS. Actually I have an inspiration from @Juan_Olano 's previous response. Combining @rmwkwok 's answer, I believe flattened normalization can be too aggressive on some features.

The contour animation provided by @AbdElRhaman_Fakhry in the beginning shows that ideally normalizing each feature separately results in a more circular contour. Flattened normalization may over compress some features so that the contour looks like an ellipse again.