If all the optimizer algorithm use gradient descent at their core, why cant we have only one?
Is the answer somewhat related to this ?
If all the optimizer algorithm use gradient descent at their core, why cant we have only one?
Is the answer somewhat related to this ?
Hi @tbhaxor,
optimizers have a long history and plenty of applications and historical numerical use cases, such as finite element simulations.
Note that there are also gradient-free optimizers that are often used for hyperparameter optimization, see also this thread: Question about optimizers - #2 by Christian_Simonis
To answer your question: Even in gradient-based optimizers there are differences and sometimes also fine nuances where you have benefits for different optimization problems. There are certainly popular optimizers in ML, but they have different strengths and weaknesses and they might be even designed to a dedicated hardware, serving different complexity levels, see also: Why not always use Adam optimizer - #4 by Christian_Simonis
Optimizers are also an active field of research and I guess in next years there will be further improvements.
Hope this answers your question, @tbhaxor.
Best regards
Christian
This precisely answered my query. I will revert back if I would need more info in future. Thanks @Christian_Simonis
You are welcome!
Sure, donât hesitate to ask if you have any further questions, @tbhaxor!
Best regards
Christian
HI @tbhaxor
While all optimizer algorithms use gradient descent at their core, they also incorporate various techniques to improve the performance and stability of the optimization process. These techniques include adjusting the learning rate, using momentum, and incorporating techniques such as Nesterov momentum, Adagrad and Adam. Each algorithm has its own strengths and weaknesses and is suited for different types of problems and datasets. Additionally, some optimizer algorithms are more computationally efficient than others, which may be important for large-scale machine learning models.
Hope this answers your question, @tbhaxor.
Best regards
Muhammad John Abbas
thanks for your message and welcome to the community!
As addition:
Not only when gradients are difficult to compute but also if you have a highly non-convex optimization resp. if you face many local minima but many of them are not sufficient to solve your business problem, gradient-free methods can help, even though they can (in general) be worse performance-wise than gradient-based ones.
Some popular examples for gradient-free optimizers are:
By the way: in this paper also a nice overview on gradient-free and gradient-based optimizers are outlined in the introduction: https://www.sciencedirect.com/science/article/pii/S0021999121006835
Best regards
Christian
Thanks for your addition @Christian_Simonis
Sure, you are welcome @Muhammad_John_Abbas.
This thread could be interesting for you, too:
Feel free to check it out and let us know if anything is unclear.
Best regards
Christian
There have already been some great answers on this thread, but I think you can state it even a bit more simply like this:
The point is that gradient descent is just the method for applying the gradients. The question is what values you actually use as the gradients that you apply in gradient descent. The different optimizers use different techniques for computing the gradients rather than just directly applying the derivatives. E.g. some of them like RMSprop and Adam use different types of smoothing and exponential average techniques to try to deal with cases in which convergence is difficult because of the extreme complexity of the solution surfaces.
Prof Ng discusses all this in the lectures. As with almost everything in ML/DL, there is no one single âmagic bulletâ answer that works the best in every case. Otherwise we wouldnât need to learn about all the different techniques that Prof Ng is showing us in these courses. It takes knowledge and experience to figure out the best solution for each particular case.
Thank you Paul. I am always impressed by your answers