Hello @u5470152,
I think @Juan_Olano has provided us a very good example to illustrate the idea behind scaling. Here, the Pricing spans over a range of 900K and has an order of magnitude of 10^5, whereas the Rooms only 4 with an order of magnitude of 10^0.
When we compute the gradient
\frac{\partial{J}}{\partial{w}} = \sum_i (error_i \times x_i),
while error_i is common to all features, x_i is the real factor that makes the scale of the gradient of one feature different from another feature. For Pricing, its scale is on average 10^5, and for Rooms it is 10^0, so we can expect for a 10^{(5-0)} difference in the order of magnitude in the gradients as well. Given that we use the same learning rate for all features, we should expect to see the weight for Pricing to change more dramatically than Rooms.
In your “old note”, you have talked about “Flat normalization” and “Column-wise normalization”. I think all of us are referring to the latter in this discussion.
What do you think?
Raymond