The learning rate (alpha) controls the magnitude of the step the cost function derivative takes when trying to find the minimum cost. My understanding is that when alpha is too large, the cost function will diverge. When it’s too small, it will require many more steps than is needed to reach convergence. In practice, what values for alpha do you start with and how do you vary that value in order to find an ‘optimum alpha’ value?
P.S. I read the posts on hyper-paramteter tuning for Lambda (regularization parameter) but I couldn’t find a post for the Alpha (learning rate) paramater.
Hey @alice.m
It’s one of those parameters we need to play around with to figure out the sweet spot. This is where outputting the cost every so often is helpful. It helps you gauge if you’re converging fast enough or not.
There aren’t a rule that control what the intial value of alpha should be but personally I found that many and many projects alpha starts with 0.1 or 0.01 rarely it is 1.2 and in this case the values of W & b (Weights) and the training examples values was so big but if you want to start with a big value of alpha and after some iteration the value decay by specific rate you can read about learning rate decay(or Learning rate scheduling) it is a method from many methods that change the value of alpha after some iterations and also my advice is to try to apply different values from alpha starting from 0.1 and if you found that your model is slowly or converge after many iteration start to increase it by small number until you found that the model run rapidly and the model converge and vice verse
I haven’t hit the lessons on Adaptive Learning Rates yet but your feedback makes total sense. After reading through a few Quora/Reddit posts, I see that most of the posts start the learning rate at 0.1 or 0.01 and then apply different types of adaptive learning rate models (step, time-based, exponential decay). I’ll need to read more about this but I really appreciate your feedback in guiding me to learn more about how this parameter is tuned. It sounds like the number of layers and neurons you use in the neural model, the size of the data values, etc all influence the initial value for alpha as well as the type of adaptive model you select to tune it.
If you do not have normalized features, then there’s no telling what an optimum learning rate might be.
If the features are normalized, then it’s a good bet that the best learning rate will be < 1.0.
For a broad evaluation, you can use a ratio of 1:3:10 for the rate increments - that is an easy approximation of a log progression.
So you might use a sequence of [0.01, 0.03, 0.1, 0.3, 1.0] and see how it works. For a simple linear or logistic regression, the solution isn’t very sensitive to optimizing the learning rate. Just find one that doesn’t cause divergence, then keep increasing the number of iterations as necessary.
Note that if you have an NN, you’re probably not going to be using fixed-rate gradient descent to find the solution. There are much more computationally efficient tools available, you’ll learn about those during the course.
Hi @alice.m I just want to add there are techniques that help you find the best hyperparameter for your dataset and model. So, it would be worth exploring more on how to use those tools once you have a good understanding of how to do it manually.