Help understanding Learning Rate Scheduler

Can someone please help me understand why is the learning rate set to increase in the following code?
lr_schedule = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-8 * 10(epoch / 20))**

Should we not decrease the learning rate (learning rate decay) as we proceed with training?

Hi Rohit! I’m assuming you’re talking about the Tune the Learning Rate section of one of the notebooks. In that case, it is important to note that the objective here is to tune the said hyperparameter. The actual training of the model and prediction will come after that section.

As you know, when you setup the model for training, you will want to choose a good learning rate to make the training faster. You might have encountered different ways to optimize this in your studies, one of which is learning rate decay where you choose a relatively large learning rate then gradually decrease as the training progress.

For this course however, Laurence presents another, more simple approach. He only wants to determine a fixed learning rate and it has to be better than a random guess. To do that, he “trains” the model while using the learning rate scheduler to try out a range of values. He starts out with small values first which will most likely be too slow. The next ones will eventually improve until the other extreme is reached where the training usually diverges.

The loss is recorded at each of these steps and visualized. From the graph generated, he can observe which range performs well by looking at where the loss is decreasing and stable. From the lectures, he recommends choosing a value at the valley of the graph. The model is then re-initialized in the next section and trained with this chosen learning rate.

I hope this clarifies what Laurence is doing and how it is very different from learning rate decay. You can even connect these two as shown in one of the next notebooks. Hope these help!


Optional:

As an aside, you might notice why his hyperparameter tuning code does not initialize the model every time the learning rate is increased. That should give a better picture of the losses because it does not depend on the previous iteration of the training. For example, with the code right now, the learning rate at epoch#5 will be used on the adjusted weights after epoch#4. That will indeed have a different result compared to when the weights are the same in every epoch.

I wasn’t able to personally ask Laurence but I think it might have to do with simplifying and speeding up the code. Since he’s mainly after the valley of the graph and only tuning one hyperparameter, I don’t think it will make a huge difference if you add the complexity of reinitializing the model every time. The small learning rates will still be slow, while the large ones will still diverge. If you reinitialize per epoch, the hyperparameter tuning step will run more slowly while the training time in the next section will more or less have the same result. Again, he just wants to show a method that works better than arbitrarily guessing what learning rate to use. From the results of the notebooks, the method works in that regard because it does make the training converge quicker especially in the initial epochs. I think re-initializing the model for each adjustment might be more important for other hyperparameters or when you are tuning more than one of them.

4 Likes

Thanks @chris.favila for the explanation, especially regarding not reinitializing the model after the learning change is changed. It would be a good exercise to see the impact of the updated weights has on the final learning rate.

Thanks Chris. It is clear to me now.