Dynamic adjustment of the learning rate

Hey all,

I’ve been coding up the gradient descent stuff in Python, and I was getting annoyed with trying to guess a learning rate by trial-and-error, so I decided to just quickly code up a dynamic learning-rate-adjustment algorithm.

I’m sure I’m not the first one to do this. There’s likely a lot of stuff out on the internet about this, and maybe we will even cover this later in the course??? But I thought I’d share what I did in case it helps others at all.

Basically, in my gradient descent algorithm, I just monitor the cost J to see if it goes up or down on any specific iteration of the loop. Then, if I see the cost J increase, I divide the learning rate in half for the next iteration. Otherwise, if I see the cost J decrease, then I increase the learning rate by 1% for the next iteration. If the previous iteration’s learning rate resulted in a decrease in cost, but the current iteration’s learning rate resulted in an increase, then it’s likely that the previous iteration’s learning rate is the largest possible learning rate that results in a decreasing cost, so I just fall back to that learning rate.

I also do an automated test for convergence rather than manually guessing how many iterations I need. The way that I test for convergence is to basically just see if the cost went down by less than 1% during the most recent iteration of the loop.

All of these parameters are adjustable, of course, but I felt that a less-than-1%-change is probably good enough for most purposes.

Here is the section of my loop that deals with all of this logic:

#Determine whether or not to change the learning rate
if (J > J_history[-1]):
    #Check to see if the previous learning rate worked...
    if (did_previous_learning_rate_work):
        #If so, set the learning rate to be the previous learning rate
        learning_rate = previous_learning_rate
        optimal_learning_rate_achieved = True
    else:
        #Decrease the learning rate by half...
        previous_learning_rate = learning_rate
        did_previous_learning_rate_work = False
        optimal_learning_rate_achieved = False
        learning_rate /= 2
else:
    #Determine whether we have finished
    miniscule_change = J * 0.01
    if ((abs(J_history[-1] - J) <= miniscule_change) or (J == 0)):
        done = True            

    #Determine whether to change the learning rate
    if (not optimal_learning_rate_achieved):
        previous_learning_rate = learning_rate
        learning_rate = learning_rate + (learning_rate * 0.01)
        did_previous_learning_rate_work = True

When the “done” variable is set to True, then my loop knows to break.

To illustrate the results, I used the extremely small/simple dataset from the course:

X_train = numpy.array([[2104, 5, 1, 45], [1416, 3, 2, 40], [852, 2, 1, 35]])
y_train = numpy.array([460, 232, 178])

After normalizing the data with the z-normalization method, I passed the data into the gradient descent algorithm.

I then ran the algorithm 3 different ways:

  1. The normal gradient descent algorithm with no dynamic adjustment of the learning rate. I set the default learning rate to 0.1. This algorithm finished/converged after 329 iterations. The final cost J = 2.8x10^-26.
  2. Gradient descent with dynamic learning rate adjustment. I set the initial learning rate to 0.1. This algorithm finished/converged after 143 iterations. The final cost J = 1.6x10^-27.
  3. Gradient descent with dynamic learning rate adjustment. I set the initial learning rate to 2 (purposefully very large). The algorithm finished/converged after (surprisingly) 32 iterations, with a final cost J = 1.7x10^-3.

Finally, I tried predicting the price of a house with each of these models. The data I used to predict was from the course:

array([1200,    3,    1,   40])
  1. Model 1 (no dynamic adjustment, a = 0.1): predicted house price = $281,683
  2. Model 2 (dynamic adjustment, initial a = 0.1): predicted house price = $281,683
  3. Model 3 (dynamic adjustment, inital a = 2.0): predicted house price = $281,696

All seem pretty comparable.

Anyway, I was curious - maybe from those who have done a bit more machine learning - is this a fairly reasonable way to do dynamic learning rate adjustment? What other methods have been tried?

1 Like

Congratulations on your experiment. That’s definitely worth doing.

You are correct, this isn’t a new concept.
Later in the course you’ll learn about the Adam optimization method. It varies the learning rate.

Well done @David_Pruitt!

It’s definitely worth doing to develop your sense around hyperparameters like the learning rate. The only suggestion I have for you is, let adjustment algorithm like yours to help you learn faster what the appropriate range of learning rate could be, but not to rely on it because you will come across many more other hyperparameters as you move on.

When there are too many hyperparameters, you will find it very time-consuming to run adjustment algorithm on a problem that has a large dataset. There are other adjustment methods, which you can modify to make them dynamic, such as Grid Search or the Bayesian optimization using Gaussian Process, but they are not necessarily smarter than your way. All in all, I have not seen any game-changing dynamic or automatic hyperparameter searching technique, so I would suggest you to not go too far into it, but developing your sense around each hyperparameter is perhaps the most important thing.

For example, maybe your adjustment algorithm can help prove that the usable range of learning rate should be more stable if we normalize the features.

Keep trying!
Raymond