In week2 minibatch learning decay was proposed with formula:
alpha= (const/sqrt(t))* initial_alpha where t is the mini batch number so this means that at the beginning of each epoch the learning rate alpha is set equal to (const * initial_alpha) I don’t fully get how can this help when we advance in training epochs
Please provide reference to the lecture / notebook where you see the reference to mini batch number in the denominator. Here’s the formula in course 2 week 2 assignment 2
\alpha = \frac{1}{1 + decayRate \times epochNumber} \alpha_{0}
Have you heard of Cyclical Learning Rates for Training Neural Networks ?
So it’s basically meant to help the model escape plateau regions like saddle points, right?
Also it’s app dependent might work from some models and others not, right?
You’re correct about escaping the plateau regions using cyclical learning rates.
Here’s the text from the conclusion section of the paper which states that the author plans to evaluate this approach on other architectures:
This work has not explored the full range of applications
for cyclic learning rate methods. We plan to determine if
equivalent policies work for training different architectures,
such as recurrent neural networks. Furthermore, we believe
that a theoretical analysis would provide an improved understanding of these methods, which might lead to improvements in the algorithms.
Where does the paper state the kind of architectures for this this approach is unsuitable?
As far as using this approach is concerned, here are the hyperparameters for a particular architecture:
- Minimum and maximum learning rates for your optimizer.
- Linear / exponential / any other approach for varying the learning rate.
- Number of batches across which the cycle should occur (this is usually set to number of batches per epoch)
Thanks a lot
