In the week 2 lecture notes for “Learning rate decay” it looks to me like we can understand LRD as a “naive” or “training data agnostic” rule for changing alpha based on the number of iterations, as opposed to RMSprop which modifies alpha based on dW and db (separately from modifying dW and db themselves which is done by Momentum for example). RMSprop intuitively seems much more powerful than the naive approach since it is data driven, yet in the lab examples it looks like the naive approach is powerful enough to obviate other optimizations. In practice, how powerful is RMSprop compared with LRD? More generally have researchers sought to combine RMSprop and LRD into a combined, optimized alpha update rule? Somehow that feels more elegant… though this is still something of a black box to me. Thanks for any thoughts.
Hi @am003e
Upfront: Adam uses the calculation of an exponentially filtered moving average, combining RMSProp and Momentum. So it’s not like a classic LRD is used, but rather a dynamic (adaptive learning rate) approach
Therefore, this thread cshould be interesting to you:
I find your thoughts interesting… Do you have a certain cost functions in mind where you see this method particularly useful also compared to other gradient-based optimizers.
Please let me know what you think
Best regards
Christian
Hi Christian, thanks for the link, in which I see your comment aligns with my thinking that the RMSprop part of Adam can be viewed as dynamic LRD. I don’t have particular cost functions in mind; in fact I just have the noted programming exercise / assignment / lab in mind where LRD kind of obliterated the relevance of the effects of the more sophisticated techniques. This seems “somewhat” counterintuitive to me but I also think the lab was just choosing some values to see what happens with “good” LRD selection. I’m trying to tell if Adam (or even just RMSprop) is generally a better choice than LRD, or if it’s better to try a few LRD values / functions and hope you get lucky on convergence and loss.
My take on this is that unfortunately there is not the only „silver bullet“ to solve all optimization problems. My experience is that it depends on the data and also the problem that you are solving, which then results in a different cost function where different optimizers have different strength and weaknesses.
Here you can find a nice overview of some popular optimizers: An updated overview of recent gradient descent algorithms – John Chen – ML at Rice University
I like to think of optimizers as tools with different complexity levels and features that help you to solve your business problem sufficiently, see also this thread.
Best regards
Christian