Why would the square error function give us multiple local minimum?
This is a convexity problem. You can plot for instance the function f(x) = (x*sin(x))**2 which has many local minima.
Applying gradient descent to such a function can be troublesome (yet possible, but it would be cheating a little bit because we know where is the minimum and it has only one variable, not thousands )
This was the motivation given in the course to use rather the logistic loss function, leading to a convex minimization problem