I just watched the first video in Feature Scaling, and this is what I understand… During learning, we compute the cost. Using the cost, we update the values of weights so they can be increased or decreased according to the learning rate. These optimal weights give us the values/predictions that result in the least error. Since features have different ranges we can’t use the same constant (learning rate) to minimize them, hence we rescale those features.
Is my understanding correct?
If it is, then why can’t we use different learning rates for different features? If feature x1 has a larger range than feature x2 then surely we can use different learning rates for x1 and x2 instead of rescaling them, right?
P.S - If, for experimental reasons, we used different learning rates, could you please explain whether a1 or a2 would be larger?
For your last question, I would not give you everything, so be ready for that
First one first:
Learning rate does not control direction. Learning rate is always positive. Gradient can be positive or negative at each training step, so Gradient controls the direction. Therefore, I wouldn’t say “according to the learning rate”.
Now your second question:
We can. We just don’t. If rescaling can save us from setting one learning rate for one weight (strictly speaking, it is not for one feature, but for one weight), then why care to set many learning rates? Don’t forget that we can have many weights.
Also, how would you decide those weights? By the ranges of features? If we have to consider the ranges to set a learning rate for one weight, then why don’t we just scale the features because scaling also considers the ranges. Afterall, managing and tuning for one learning rate is easier than that for many learning rates.
This is the part I will leave it to you to experiment, and/or to think, because you can get the answer yourself by experimentation.
For the “think” path, I can give you something to start with. Recall that the following is the gradient formula for a weight:
The change of weight does not just depend on the learning rate \alpha, but also the gradient
If you look at the formula, the gradient is proportional to x^{(i)} (which in turns affected by the range)
Alright, that’s all I will say. If you come up with any theory from the “think” path, I suggest you to experiment the theory with some data that has features of pretty different ranges.