Why Gradient Decent is required

When we can differentiate the cost function and find parameters by solving equations obtained through partial differentiation with respect to every parameter and find out where the cost function is minimum. Also I think its possible to find multiple places where the derivatives are zero, thereby we can check for all such places and can find global minima

why is gradient descent performed instead?

@prtata, that’s a good question. We all wrestle with it for a time.

If we could find a closed-form, solvable form for the derivatives that we could solve to find where they are zero, we would do that. But such forms don’t exist for anything realistic.
We need an iterative mechanism, and gradient descent (with its many variations you’ll see in course 2) is a good choice for doing that iteration. The general approach of finding a loss/cost function and then minimizing that cost/loss through gradient descent is used in many machine learning methods.


My cents.
In school, when we want to find the minimum of a function whose gradient can be calculated, we normally don’t apply GD. Instead, we just find the minimum analytically by directly solving for when the gradient is 0. But in the scenario where we have complex calculations, and consider optimization, GD is really faster to solve our problem!

Exactly: the “set the derivative to zero and solve” method doesn’t help. It just makes things more complicated, because you now have another equation that can’t be solved in closed form (as Gordon pointed out). So you need another “iterative approximation” method like the multidimensional equivalent of Newton-Raphson. But that means you need the second derivatives of the cost if you think about it. So it’s just making the problem more complicated. Doing direct GD on the cost is more straightforward.

The OP is correct that we have to worry about local minima, saddle points and the like, but it just turns out that the mathematics work in our favor here. There is a paper from Yann LeCun’s group which shows that for networks that are sufficiently complex, there is a range of good solutions that can be found by GD.

1 Like