Exactly: the “set the derivative to zero and solve” method doesn’t help. It just makes things more complicated, because you now have another equation that can’t be solved in closed form (as Gordon pointed out). So you need another “iterative approximation” method like the multidimensional equivalent of Newton-Raphson. But that means you need the second derivatives of the cost if you think about it. So it’s just making the problem more complicated. Doing direct GD on the cost is more straightforward.

The OP is correct that we have to worry about local minima, saddle points and the like, but it just turns out that the mathematics work in our favor here. There is a paper from Yann LeCun’s group which shows that for networks that are sufficiently complex, there is a range of good solutions that can be found by GD.