Why is it a problem if one feature achieves its lowest value faster than others? I suppose we’ll need more iterations to have the other parameters converge which requires more resources but are there other disadvantages that I am missing?
If all features of a model are within a certain range, say, 95-105, is feature scaling still recommended?
For the first question about one feature achieving its lowest value faster than others, could you elaborate on why we wouldn’t be able to use a large learning rate?
For the second question, are there any guidelines on what ranges are considered large?
Again for the second question, if values for all features lie within a small range, is feature scaling still necessary?
This post discussed the relationship between learning rate size, feature scales, and cost contour regularity. You will also notice that the discussion is based on a lecture slide so you might review some lectures again for more explanation.
The key of feature scaling is for all features to span over a similar range, not the best range and not a small range. Usually people applies one of the first three methods in this wikipedia section to all features for the job. Those three methods will all result in a “small range” around zero, but being small is not the key, having a similar range among all scaled features is the key.
If you scale all features to a similar range, you might achieve a more regular cost contour ( as examplified in the lecture slide quoted in the linked post), if you exempt some features from scaling, you take the risk of having a less regular contour.
I recommend you to try to answer your own questions by doing some real experiments on different datasets. That will give you a more concrete idea, and you will be able to practice what you are trading off when not scaling each and every one of the features using the methods I mentioned in that wikipedia section.