In the cost function formula video, Prof. Ng says:

In machine learning different people will use different cost functions for different applications, but the squared error cost function is by far the most commonly used one for linear regression and for that matter, for all regression problems where it seems to give good results for many applications.

Why is the squared error the most popular cost function for regression? What are the other alternatives, and why are they not as popular?

There are other cost functions like Root Mean Squared Error (just the square root of MSE) and Mean Absolute Error, MAE ( sum(abs(y_true, y_predicted)) ), but MAE is not sensitive to large errors, whereas MSE squares the error term so the large the error, the large the MSE.

I am absolutely not the right person to talk about history… But I do think there is a historical reason for why squared error is so popular. A method called “least squares” was invented as early as 200 years ago (wikipedia, a survey paper in 1887 ), and the idea behind is basically the combination of linear regression and squared loss that we are learning today.

A reason why neural network is popular today is because we have the computational power to process big data, and the same rationale could also be considered for the case of least squares, although in a completely opposite direction. Least squares’ nice mathematical properties allow people to calculate the weights even without computers - only basic arithmetic operations needed. Remember we are talking about 100 years ago.

Hope that someone who knows the history better can talk a better story… All in all, I think it is popular today because of its wide range of applications through the years during the time when computers are not powerful or popular.