Why would the square error function give us multiple local minimum?

This is a convexity problem. You can plot for instance the function f(x) = (x*sin(x))**2 which has many local minima.

Applying gradient descent to such a function can be troublesome (yet possible, but it would be cheating a little bit because we know where is the minimum and it has only one variable, not thousands )

This was the motivation given in the course to use rather the logistic loss function, leading to a convex minimization problem

2 Likes