Ok so I just watched the 2 videos in the machine learning specialization about feature scaling and I am wondering whether the same thing can be achieved by using different learning rates for the different independent variables in accordance with the different ranges?
Also is it necessary that our learning rate has to be the same for the different variables?
Welcome to the community Numair,
The Concept of feature scaling is to get all the independent variables (features) into the same range which will help in converging to the minimum faster in gradient descent. See the picture attached. If there are features in different ranges such as Salary (in lakhs) and Age (1-100), then this might create more importance to salary because of the bigger values which we don’t want. The features should be given in the same range and the model will decide which features have the highest impact on the target upon learning.
Coming to the concept of learning rates, It comes up when you are updating the parameters in back propagation. The concept of learning rate is how big or how small you want your update to be which is why It doesn’t make sense to specify a learning rate for each independent feature.
If this helps, back propagation works as follows:
- Forward pass through all the examples in the training set
- Calculate the error which is the cost function (the sum of losses across examples)
- Calculate the gradients
- Update the parameters using gradients with a learning rate
Instead of using different learning rate for each feature, You can reduce the learning rate for each epoch (One full iteration of the training data) or after some epochs. As you go closer to the minimum, the updates becomes smaller. In that case, if you also reduce the learning rate, this will help in converging to the minimum in a faster and more accurate way.
Cheers,
Ajay
In practice that would be rather difficult, as there is no means to know in advance what rates to apply to each feature. If there are hundreds of features, that’s an impossible task.
Hello @Numair,
If we look at the graphs and explaination shared by Ajay, we can see the underlying reason of applying feature scaling. If we are to tackle the problem with different learning rates, the way we set those learning rates would probably also have to address that underlying reason, and at the end we might still want to make sure those learning rates are related to each other by their variances. For example, we might want a smaller learning rate for features that span a smaller range (which means a smaller variance), and we might want a larger learning rate for those which have a larger variance.
If we do as I describe that we set the learning rates by the variance, we are just moving the variance factors from feature scaling to “learning rate” scaling, and we are not introducing any new ideas.
As far as linear regression and logistic regression are concerned because we are in course 1, I think the answer is yes, and because the answer is yes and because of what I have explained in above, we can just stay with feature scaling.
However, if we are to think about neural networks which we will briefly touch in course 2 and they are models that are more complex than linear regression and logistic regression, then I believe the case will be quite different. Because you will see that for some neural network’s learning parameters, they are no longer just stick to a feature, which means that it is difficult for us to get a set of informed choices of learning rate by the “variance adjustment” as I mentioned in the above. In other words, we might have to make a lot of random guesses which is not efficient. Ofcourse, this is just my worst case theory and I have never experimented with that, but since feature scaling already solves the problem, I don’t see why we want to find another way that at least appears more difficult.
Theoretically speaking, it is not necessary, but practically, I have not seen yet any library developed for us to assign a different learning rate for different learning parameters, and because of this, it is necessary unless you want to develop your own library.
There is one thing that won’t be covered in this MLS that I want to share with you, is that, built upon the vanilla gradient descent that we are learning in MLS Course 1, there is a very popular and way more commonly used optimizer called “Adam”. “Adam” is also a gradient descent based algorithm, and one special thing about it is that it will remember something about the history of the gradients for each learning parameters, and it will use that information for future updates of the parameters. Therefore, by incoporating history information, you can think of Adam as a tool to effectively adjusting learning rate for every learning parameters individually. In simpler words, with Adam, and with feature scaling done, we set one same initial learning rate for every learning parameter, and then as training goes, different parameters will learn in a different pace also due to the parameters’ histories.
Cheers,
Raymond
I had the same doubt as @Numair and I have also just finished feature scaling video and had this same doubt and found this post. Both @ajaykumar3456 and @rmwkwok reply really helped me get better understanding of what’s happening, so thank you to both.
Although, I am convinced about feature scaling, I don’t agree ajay’s take on us giving more importance to some features, because those importance/bias will change as the algorithm progress right?. I mean the way I see it it does not matter if we scale it and have equal bias to all features or if we initialize the bias matrix with random numbers. The importance will be derived from the training data regardless. Is it important in machine learning that we start with equal bias?
How I am convinced is that maybe if features vary in values by many magnitudes then the calculations will be more intensive since bias(w) values could get very large which could have been avoided by feature scaling. Also, Instead of doing scaling multiplication twice at the start and end with features/inputs, we will be doing scaling multiplication with alpha every iteration.
I don’t understand the back propagation reference fully because I have yet to learn about it. I do like the example of Adam optimizer. The way I understood, Its basically best of both worlds kind of thing.
Hello @tinted,
Thank you for sharing your understanding!! One thing that I can immediately think of when you said “intensive” is the following equation for updating w:
See that it is proportional to x^{(i)}, and therefore, without feature scaling:
- then x^{(i)} is not scaled,
- then x^{(i)} can have a very large range (100000 - 500000) or a very small range (0.00000001 - 0.00000005),
- then the gradient can be too large or too small because of the proportionality,
- and if different features have different ranges, their gradients have different scales too! As you said, some can be pretty “intensive”.
Cheers,
Raymond
PS: Just a very minor note, you said “bias(w)” and mentioned “bias” multiple times, so it seemed to me you were referring to the w in y=wx+b as bias. Just in case I understood you correctly, I wanted to point out that, in these courses, we have a different naming convention: we call w as weight(s) and b as bias, and perhaps this is also how they get their variable symbols.
When I said intensive I was only thinking about weight values but yep, like you mentioned we even use features/inputs for calculations and even that will increase computation.
Also sorry for the confusion with bias and weights terminology, I am very bad with remembering names so I am naturally bad with terminology as well but yes by bias I meant weights or w. I always though of b as some constant but now that I payed attention that constant should have some name as well and that being bias makes sense.
As I am new to this field both weights and bias seemed same but I will try to follow the naming convention from now on.