I was surprised to see Dr. Ng leads us to the mean squared error cost function instead of least squares. When learning about linear regression in a stats context or “line of best fit,” I was taught the “residual sum of squares” and the “least squares” procedure.

Why is “mean squared error” used here instead of “least squares”? Would the w and b computed by the mean squared error approach be the same as the w and b computed by a least squares approach?

Computing w and b for least squares seems pretty straightforward, “formulaic,” and procedural… is it slower or faster than using gradient descent and the mean squared error?