While reading about feature scaling and its needs, I was wondering if having different step sizes for different parameters an alternate to feature scaling? Also, what is the reason for keeping the step size in gradient descent between 0 & 1?
Using different learning rates for each feature isn’t feasible - you have no way to intelligently modify each one individually.
You don’t necessarily have to keep the learning rate between 0 and 1, but those seem to be the most often used.