Hi Rohit! I’m assuming you’re talking about the Tune the Learning Rate
section of one of the notebooks. In that case, it is important to note that the objective here is to tune the said hyperparameter. The actual training of the model and prediction will come after that section.
As you know, when you setup the model for training, you will want to choose a good learning rate to make the training faster. You might have encountered different ways to optimize this in your studies, one of which is learning rate decay where you choose a relatively large learning rate then gradually decrease as the training progress.
For this course however, Laurence presents another, more simple approach. He only wants to determine a fixed learning rate and it has to be better than a random guess. To do that, he “trains” the model while using the learning rate scheduler to try out a range of values. He starts out with small values first which will most likely be too slow. The next ones will eventually improve until the other extreme is reached where the training usually diverges.
The loss is recorded at each of these steps and visualized. From the graph generated, he can observe which range performs well by looking at where the loss is decreasing and stable. From the lectures, he recommends choosing a value at the valley of the graph. The model is then re-initialized in the next section and trained with this chosen learning rate.
I hope this clarifies what Laurence is doing and how it is very different from learning rate decay. You can even connect these two as shown in one of the next notebooks. Hope these help!
Optional:
As an aside, you might notice why his hyperparameter tuning code does not initialize the model every time the learning rate is increased. That should give a better picture of the losses because it does not depend on the previous iteration of the training. For example, with the code right now, the learning rate at epoch#5 will be used on the adjusted weights after epoch#4. That will indeed have a different result compared to when the weights are the same in every epoch.
I wasn’t able to personally ask Laurence but I think it might have to do with simplifying and speeding up the code. Since he’s mainly after the valley of the graph and only tuning one hyperparameter, I don’t think it will make a huge difference if you add the complexity of reinitializing the model every time. The small learning rates will still be slow, while the large ones will still diverge. If you reinitialize per epoch, the hyperparameter tuning step will run more slowly while the training time in the next section will more or less have the same result. Again, he just wants to show a method that works better than arbitrarily guessing what learning rate to use. From the results of the notebooks, the method works in that regard because it does make the training converge quicker especially in the initial epochs. I think re-initializing the model for each adjustment might be more important for other hyperparameters or when you are tuning more than one of them.