Why using Gradient Descent

In classic linear regression courses such as econometrics we use OLS and not Gradient Descent. I just wanted to ask why we use Gradient Descent and not OLS or another approximation such as using lagrange multipliers, etc.

The problem is that OLS is not a general method: it works for linear regression, but not for full neural networks and for classification problems where the loss function is not based on Euclidean distance. For that matter linear regression has a closed form solution called the Normal Equation, but the computational complexity of that is higher than Gradient Descent, so with sufficiently large numbers of parameters, Gradient Descent can actually be more efficient.

Gradient Descent is a general method that applies in all cases. We are just getting started on Neural Networks here and there is much more to learn. They come in lots of different architectures and can be used to address lots of different types of problems.

I am not familiar with Lagrange Multipliers, but here’s a tutorial from Jason Brownlee’s website about how they might apply in ML contexts. But you can assume that Prof Ng and Yann LeCun and Geoff Hinton and all their grad students over the years know a lot of mathematics and there’s a reason why the field in general uses “Conjugate Gradient” methods like Gradient Descent to solve the optimization problems here. In other words, it not just that they haven’t thought of the other methods you mention.