Yes, it is controlled, but think about this, there will still be order of magnitude differences between the features after the flat normalization. It is just that the overall order of magnitude for every feature becomes smaller than before the flat normalization is applied. Agree? If the overall order of magnitude is smaller, then it’s natural that some learning rate that would not work before the flat normalization can become feasible.
Perhaps the table below will summarize well what one can expect to see in terms of order of magnitude?
| Feature | No normalization | Flat normalization | column-wise normalization |
|---|---|---|---|
| Pricing | 10^5 | 10^0 | 10^0 |
| Rooms | 10^0 | 10^{-5} | 10^0 |
I think one general idea is that, if there is a significant order-of-magnitude difference among features, we need a small enough learning rate to avoid divergence and because of that, the learning process will be slow.