Why there are so many optimizer algorithms?

tbhaxor · January 14, 2023, 10:38am

If all the optimizer algorithm use gradient descent at their core, why cant we have only one?

Is the answer somewhat related to this ?

Christian_Simonis · January 14, 2023, 11:20am

optimizers have a long history and plenty of applications and historical numerical use cases, such as finite element simulations.

Note that there are also gradient-free optimizers that are often used for hyperparameter optimization, see also this thread: Question about optimizers - #2 by Christian_Simonis

To answer your question: Even in gradient-based optimizers there are differences and sometimes also fine nuances where you have benefits for different optimization problems. There are certainly popular optimizers in ML, but they have different strengths and weaknesses and they might be even designed to a dedicated hardware, serving different complexity levels, see also: Why not always use Adam optimizer - #4 by Christian_Simonis

Optimizers are also an active field of research and I guess in next years there will be further improvements.

Hope this answers your question, @tbhaxor.

Best regards
Christian

tbhaxor · January 14, 2023, 7:02pm

This precisely answered my query. I will revert back if I would need more info in future. Thanks @Christian_Simonis

Christian_Simonis · January 14, 2023, 7:09pm

You are welcome!

Sure, don‘t hesitate to ask if you have any further questions, @tbhaxor!

Best regards
Christian

Muhammad_John_Abbas · January 14, 2023, 8:31pm

HI @tbhaxor

While all optimizer algorithms use gradient descent at their core, they also incorporate various techniques to improve the performance and stability of the optimization process. These techniques include adjusting the learning rate, using momentum, and incorporating techniques such as Nesterov momentum, Adagrad and Adam. Each algorithm has its own strengths and weaknesses and is suited for different types of problems and datasets. Additionally, some optimizer algorithms are more computationally efficient than others, which may be important for large-scale machine learning models.

Hope this answers your question, @tbhaxor.

Best regards
Muhammad John Abbas

Christian_Simonis · January 14, 2023, 9:02pm

Hi @Muhammad_John_Abbas

thanks for your message and welcome to the community!

As addition:

Not only when gradients are difficult to compute but also if you have a highly non-convex optimization resp. if you face many local minima but many of them are not sufficient to solve your business problem, gradient-free methods can help, even though they can (in general) be worse performance-wise than gradient-based ones.
Some popular examples for gradient-free optimizers are:

By the way: in this paper also a nice overview on gradient-free and gradient-based optimizers are outlined in the introduction: https://www.sciencedirect.com/science/article/pii/S0021999121006835

Best regards
Christian

Muhammad_John_Abbas · January 14, 2023, 11:41pm

Thanks for your addition @Christian_Simonis

Christian_Simonis · January 15, 2023, 7:46am

Sure, you are welcome @Muhammad_John_Abbas.

This thread could be interesting for you, too:

Question about optimizers - #2 by Christian_Simonis

Feel free to check it out and let us know if anything is unclear.

Best regards
Christian

paulinpaloalto · January 21, 2023, 3:45pm

There have already been some great answers on this thread, but I think you can state it even a bit more simply like this:

The point is that gradient descent is just the method for applying the gradients. The question is what values you actually use as the gradients that you apply in gradient descent. The different optimizers use different techniques for computing the gradients rather than just directly applying the derivatives. E.g. some of them like RMSprop and Adam use different types of smoothing and exponential average techniques to try to deal with cases in which convergence is difficult because of the extreme complexity of the solution surfaces.

Prof Ng discusses all this in the lectures. As with almost everything in ML/DL, there is no one single “magic bullet” answer that works the best in every case. Otherwise we wouldn’t need to learn about all the different techniques that Prof Ng is showing us in these courses. It takes knowledge and experience to figure out the best solution for each particular case.

tbhaxor · January 21, 2023, 4:15pm

Thank you Paul. I am always impressed by your answers

Topic		Replies	Views
Question about optimizers AI Discussions	2	183	January 5, 2023
Why we need GD algorithm if we can still optimise parameters of the model? Calculus for Machine Learning and Data Science week-module-2	7	371	August 24, 2023
Gradient Descent vs Searching for Minimum AI Discussions	2	70	July 17, 2022
Why using Gradient Descent Supervised ML: Regression and Classification week-module-1	1	424	August 15, 2023
Adaptive Learning Rates AI Discussions	2	99	October 31, 2023

Why there are so many optimizer algorithms?

Related topics