More advanced Optimization Algorithms are not discussed - why?

https://www.coursera.org/learn/deep-neural-network/home/week/2

Various variants of Gradient Descent are discussed (SGD, RMSProp, Adam), but
there is no mention about “more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG)”
as described in On optimization methods for deep learning by Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y. Ng

What is the reason for that?

1 Like

That paper is 12 years old. That’s a very long time in the machine learning field.
Apparently those methods are not needed with the current sets of tools.

1 Like

As Tom says, a lot can change in 12 years. Not just in software: there’s Moore’s Law to keep in mind as well. What constituted “Limited Memory” in 2012 was quite a bit smaller than it is today. :nerd_face:

1 Like

I am not aware of any research/papers showing described Gradient Descent variants (Adam, etc) outperforming L-BFGS/CG (I would appreciate some pointers!). Anyhow, some discussion about pros and cons of these methods would be a nice addition.

1 Like

Here’s the docpage that shows the list of optimizers provided by TensorFlow. TF is implemented by Google and is one of the state of the art platforms that many people use for implementing ML/DL systems. I have not yet tried doing any literature searches on the question you posed, but I assume that the researchers at Google from Jeff Dean on down are aware of the literature. If BFGS is better than the other algorithms on the list, then why did they not include it on the above list? Or maybe it’s really there, but it’s been further elaborated in the meantime and is called something different now?

1 Like

I am not advocating for L-BFGS as a “silver bullet” solution. However, typically methods which try to incorporate second-order information (when possible) perform better, and also a “adaptive” methods which do not require extensive hyperparameter tuning should be more desirable. Oh well, maybe nobody bothered recently to do any head-to-head comparison of different optimization methods in deep learning context.

I recommend you read up on the Adam method.

I think that is not a good assumption. Optimizing training performance is a very big deal and will be for the foreseeable future. Even as Moore’s Law gives us faster more capable hardware every year, the types of models people want to train are always at the hairy edge of what is possible. Look at the history of the parameter counts on GPT-3 and GPT-4 and what is proposed for GPT-5.

You should not assume that people have just been lazy and ignored or couldn’t be troubled to look at L-BFGS. My assumption is that they considered it and decided that other methods worked better in general, but we need to do more research to confirm that.

“Bothered” was a bad choice of the word on my part. I fully understand that comparing multiple optimization methods on fairly representative set of problems is not a trivial undertaking, and in no way intend to suggest that lack of such recent comprehensive assessment is due to laziness etc. Rather, the lack of such comparison illustrates to me the difficulty of such undertaking.