Gradient Descent vs Newton's Method

Do practitioners ever use Newton’s method to minimize cost function?
(or does this method only work for ‘nice’ functions?)
I would imagine that Newton’s method would converge in the fewest number of steps to the minimum vs finding optimal learning rates for GD.

It’s an interesting question that comes up pretty frequently. If you took calculus, you remember that to find an extremum, you can set the derivative to zero and solve. But that also gives you an equation with no closed form solution, so you have to resort to something like Newton-Raphson in multiple dimensions. But think about what that means: now you need the second derivative of the cost surface in order to find the zeros of the first derivative. It just ends up making things more complicated and doesn’t really give you any advantage.

Here’s a previous discussion of this point in the context of MLS. Here’s one from DLS. You can find more by searching for “Newton” or “Newton-Raphson”.


Not in my experience.