DLS Course 2 W2 - Why Steepest Descent is never used as optimization algorithm?

Hi, I am a beginner in deep learning. I am currently at DLS Course 2. I took a non-linear operations research course and we used steepest descent algorithm for faster convergence. Here is the pseudo code for the steepest descent:

It is same as the basic gradient descent but instead of a constant learning rate (alpha), steepest descent finds the learning rate that makes the cost function minimum on the negative gradient direction in every iteration. With the optimum step size, the algorithm converges to the optimum way faster. I also implemented this algorithm to Course 1 Week 4’s assignment. Algorithm tries to find the best learning rate between 0 and 1 in every iteration. Since it makes learning rate optimization, iteration times are longer than basic descent but it converges in less iterations. For example in week 4’s assignment gradient descent converges to 0.08… cost in 2500 iterations in 13 minutes while it took only 350 iterations to reach 0.08… cost with steepest descent in 8 minutes.

I searched the web about steepest descent in neural networks but I couldn’t find an answer to my question. Why steepest descent is never used as an optimization algorithm in neural networks?

Welcome to the community.

As you know, “steepest Descent” is part of “gradient Descent” family, which has common problems for non-convex optimization.

Source : Paradyumna Yadav, The journey of Gradient Descent — From Local to Global | by Pradyumna Yadav | Analytics Vidhya | Medium

The problems are;

  • We may fall into a local minimum, not a global minimum.
  • At the Saddle point, the gradient becomes 0. So, we can not exit from there.

To overcome the above situations, several optimization algorithms have been published.
Neural network frameworks, like Tensorflow, provide those optimization algorithms for your use. Faster convergence is one factor, but if it does not reach to our desirable condition, then, it may not be selected for a practical use. Then, advanced optimization algorithms like Adam, SGD, etc, become more popular for non-Convex optimizations.