Cost function: why mean squared error instead of least squares?

I was surprised to see Dr. Ng leads us to the mean squared error cost function instead of least squares. When learning about linear regression in a stats context or “line of best fit,” I was taught the “residual sum of squares” and the “least squares” procedure.

Why is “mean squared error” used here instead of “least squares”? Would the w and b computed by the mean squared error approach be the same as the w and b computed by a least squares approach?

Computing w and b for least squares seems pretty straightforward, “formulaic,” and procedural… is it slower or faster than using gradient descent and the mean squared error?

Hi Joe @ybakos ,

This specialization is preparing learners for working with deep neural network which in general can’t be solved by the least square method. Instead of the least square method, the gradient descent method can be used on a neural network, and so we learn gradient desent :wink:


PS: Mean squared error is just one of the many objective functions you may use with the gradient descent method.