# From mean squared error to quadratic equation

In Cost Function Intuition, Dr. Ng plots the training data on the left with a line of best fit. On the right, we plot J(w) for different values of w. What we get on the right is a parabola, where the vertex represents the best w that minimizes J(w).

Since I know how to find the vertex of a parabola given its equation (eg y = ax^2 + bx + c), in theory I could quickly determine the vertex, and therefore the best w. However, how can I determine what that formula is? Is there a handy way to go from the mean squared error formula to the quadratic one?

(My math is super rusty!)

2 Likes

Hi @ybakos
in next few weeks you will learn how to get the best W and how to minimize the cost function(mean squared error) by algorithm called gradient descent

Thanks,
Abdelrahman

2 Likes

Yep, I am aware of gradient descent.

My question is, is there a way to find the equation of the parabola? If I had that, then I can quickly find the vertex of the parabola. At least, in a single-variable problem.

Sure, but the whole point here is that the “single variable” case is not interesting for us. The “variables” in our case are all the “parameters” of the model. That is to say all the weight and bias values. In any non-trivial problem there are a lot more than one of those, right?

Yep, I get that single-variable is the most simple situation and not applicable in n-dimensional problems.

I am specifically wondering how, in the single-variable case, we can get from the mean squared error to a quadratic formula of w and J(w).

You’re asking for the formula for J, right? Well, what is that? It’s the average of the squared error over all the training samples, right? You’ve got m samples and each one will be (x_i, y_i) where x_i is the one input value and y_i is the label (correct output) for that sample. So how do you convert that into a quadratic equation in w? It’s going to be a bit of a mess.

by prediction = b + W1X + W2X^2
mean squared error = sqr((prediction - y)^2)
by gradient descent you can reach the minimum of J and the best W1 and W2
by equations
b = b- alpha/m * sum(prediction - y)
W1 = W1- alpha/m * sum((prediction - y)*X)
W2 = W2- alpha/m * sum((prediction - y)*X^2)

Well, I could say that we already have what you want under certain assumptions.

Check this out:

The MSE is (1/m) * SUM[(y - yhat)**2]

The standard solution is that, multiplied by -1: -(1/m) * SUM[(y -yhat)**2]

Take the quadratic formula: ax2 + bx + c

Lets replace x with (y - yhat)

We arrive to: a(y - yhat)**2 + b(y-yhat) + c

Lets assume a=1, b=0, c=0. Under these assumptions we have:

1*(y-yhat) ** 2 + 0*(y-yhat) + 0 = (y-yhat)**2

Plug that into the MSE above:

-(1/m) * SUM[ (y - yhat)**2]

So we are back to our standard formula.

Oh, interesting!

But, to find the x component of a vertex of a parabola in the form ax^2 + bx + c, I can just compute -b / 2a.

If I may ask, is it not possible to determine concretely what a is or b is given only the training set and the ability to compute the mean squared error?

What I am really after is -b / 2a, of the standard form of the parabola, which would tell me the w that minimizes J(w).

In the simplest scenario, lets say ax^2 + bx + c is in the world of scalars. A NN built only with this turns out to be just a linear equation. You can easily get -b/2a in this scenario where W would probably be ‘a’ and ‘b’, as W would probabaly be just a scalar.

Lets take one step to another scenario, like an ‘n’x’m’ image with 3 channels,: You could then say that ‘a’ and ‘x’ are matrices, so here we would be talking about AX^2+BX+C, where A, B, X are matrices, but in this case may be we are not seeing a parabola in a 2D space anymore, but potentially an n-dimensional space with multiple local minima, and at that point the -b/2a formula might not make sense to find the global minima.

Hi Joe @ybakos,

Your parabola has to be a function of w in order for you to calculate the solution of w following that quadratic way. If you even attempt to write a neural network’s cost function in terms of w, you will see that in most cases it is not a quadratic equation of w.

However, there is a special case that it is a quadratic of w, which is a simple linear regression (e.g. y=wx) together with mean squared error cost. In this case you can calculate w analytically with a closed form solution called “the normal equation” (please google this name for how it looks like).

Next time, when you have a neural network that is not a simple linear regression, try to write out the cost function as a function of w to see if it is a parabola in w.

Raymond

Thanks @rmwkwok . I will look into this more.

The connection I am trying to make is this.

We have a cost function, as shown here:

And we can simplify it further by making b = 0.

As such, we can compute J(w) for any number of w. This is represented by the parabola on the right:

Since we can compute all the points on the parabola on the right using the cost function for different kinds of w, why / how can I represent the parabola on the right in normal quadratic form, or in vertex form ( J(w) = a(w-h)2 + k ) ?

First, I want to emphasize that your model assumption of

is a linear regression problem, which is the special case I mentioned to be solvable analytically. In general, we cannot solve any neural network analytically.

I said it is important for you to write the cost function out in w to see it is a parabola in w, but then you ask me to show you that. I will give you the first step but you will need to expand things out yourself.

For the linear regression problem which is a special case that you can solve analytically, the first step of J is

J(w) = \frac{1}{m}\sum_{i=0}^{m-1} {(wx^{(i)}-y^{(i)})^2}

I believe you can expand the square, and rewrite it to the form you like. Please do it yourself.

Raymond

1 Like

Yes, it’s just algebra at that point, but (as I commented earlier) it’s a bit of a mess, because you have one quadratic term for every training data sample. Just start with the general form that Raymond gave you and write out all the m terms. Then you just have to expand all those polynomial terms and then “gather” the coefficients to get the form you are looking for. Go for it.

If you try to do that in full generality (meaning with x_i and y_i instead of specific values), you’ll probably eventually end up with the Normal Equation. That is the general “closed form” solution for Linear Regression that Raymond mentioned above. But it’ll be a fair amount of work to get there.

Thanks @rmwkwok and @paulinpaloalto . Yes, I see that this is only for linear regression and wouldn’t be applicable for more complex problems, eg neural nets.

I think I have enough info now to try working this out on my own. If I can come up with a little walkthrough / reasoning, I will post it here.

Thank you!

Hi @ybakos,

I see a lot of good answers but I think the best answer in my opinion is missing. I’ll add it here.

The answer to your question is yes and there is a nice formula for the solution. For completeness I’ll explain how it works not just in the single variable case, but in the multivariable case (so the single variable case follows from this). We have a matrix \mathbf{X} of examples (with a column of all 1’s added at the start to deal with the bias) and a vector \mathbf{y} of targets. Then the cost function can be written

J(w) = \frac{1}{m} ||Xw-y||^2 = \frac{1}{m}(Xw-y)^\top (Xw-y) = \frac{1}{m}(w^\top X^\top X w - 2y^\top Xw+y^\top y)

so that (using the rules of matrix differentiation which are not taught in this course) we have

\nabla_wJ(w) = \frac{1}{m}(2w^\top X^\top X - 2y^\top X)

In order to have a global minimum the gradient must be zero so we find that (after taking the transpose of the equation)

X^\top Xw = X^\top y

Under the assumption that the columns of X are linearly independent, it follows that X^\top X is invertible and therefore we have

w = (X^\top X)^{-1} X^\top y.

This is what is usually called the “normal equations” for the solution of linear regression. Even though there is a closed-form formula for the global minimum (which is not the case for most machine learning cost functions) it is still usually better to use gradient descent because computing X^\top X often causes more computational issues than it is worth (when there is a large dataset).

I hope that provides some useful information!

Best,
Alex

Edit: Sorry, I didn’t notice but @rmwkwok did indeed mention the normal equations

Edit 2: Oops, also @paulinpaloalto mentioned this. In any case, I hope the extra level of detail is useful.

I appreciate the explanation. A bit over my head, lol. I’ll study it.

1 Like

Ok, so I have worked out an answer to my question, and I’ll post a walkthrough here to help explain what I was hoping to achieve. I understand this is not broadly applicable. Or if I am even mathematically sound here.

This is an exploration, not a suggestion.

Let’s tart with Dr. Ng’s example in Cost Function Intuition, but with a different training set.

Given the two-dimensional training data set, (1, 2), (2, 3) and (3, 1)

And given the constraint that our line of best fit will have a b or y-intercept of 0

And given the simplified cost function of mean squared error, with just parameter w (no b)

Determine the optimum w that minimizes J(w), by completing a quadratic equation and computing the vertex.

But first, let’s do what Dr. Ng does to build some intuition. Let’s consider the lines for three contrived values of w, and compute the cost, J(w).

Next, let’s visualize these three contrived values of w against their cost, J(w). It might look something like this:

“But wait!” I say, “If you show me a parabola, I know we can find its vertex, if we have enough information, such as the general quadratic form of the parabola, or the vertex form of the parabola.” In this contrived, simplified J(w) mean squared error situation, we can indeed come up with that form of the parabola.

We can expand m, x and y using the values in the training set, to express J(w) as a quadratic equation.

Given we have the general form, we identify a as 14/3 and b as 22/3. We can then compute the vertex with -b/2a:

This informs us that w ~= 0.79 should give us the approximate minimum of J(w). If we compute J(0.79), we get approximately 1.78.

Let’s graph the line and compute the mean squared error:

So, in this simplified scenario, where we compute the mean squared error with the constraint b = 0, we can indeed compute the optimal w by expanding the cost function with each value of x and y in the training set, simplifying the factors into an approachable form (eg general quadratic or vertex form) and directly computing the vertex (eg -b/2a).

Yes this is exactly correct. Good work! We can also see this from the perspective of the normal equations as follows:

With the b=0 assumption, we do not need the column of 1’s in our X matrix so

X = \begin{pmatrix}1 \\ 2 \\ 3\end{pmatrix}

and

y = \begin{pmatrix}2\\3\\1\end{pmatrix}.

The normal equations tell us the solution is

w=(X^\top X)^{-1}X^\top y = \left(\begin{pmatrix}1 & 2 & 3\end{pmatrix}\begin{pmatrix}1 \\ 2 \\ 3\end{pmatrix}\right)^{-1}\begin{pmatrix}1 & 2 & 3\end{pmatrix} \begin{pmatrix}2\\3\\1\end{pmatrix} = \frac{11}{14}.

1 Like

Great example Joe @ybakos!

Below is equivalent to Alex’s and your work,

J(w) = \frac{1}{m}\sum_{i=0}^{m-1} {(wx^{(i)}-y^{(i)})^2} = aw^2 + bw + c

where a = \frac{1}{m}\sum_{i=0}^{m-1} {x^{(i)}}^2, b = -2\frac{1}{m}\sum_{i=0}^{m-1}x^{(i)}y^{(i)}, c = \frac{1}{m}\sum_{i=0}^{m-1} {y^{(i)}}^2

Plugging in your numbers, a = \frac{14}{3}, b=-\frac{22}{3}, and thus w_{opt} = -\frac{b}{2a} = \frac{11}{14}

In other words, w_{opt} = -\frac{b}{2a} = \frac{\sum_{i=0}^{m-1}x^{(i)}y^{(i)}}{\sum_{i=0}^{m-1} {x^{(i)}x^{(i)}}}

Joe, your idea is right. I particularly agree that:

Indeed if we have the full curve (i.e. we know the cost value for all w), then we can get directly the optimal w instead of having to “explore” for the optimal w which is what gradient descent is doing. Knowing the cost to be a parabola, we need only 3 points to know the full curve. When the cost is not parabola, however, we may need so many more points that gradient descent becomes more effective (in terms of number of computations) in finding any local optimum.

Cheers,
Raymond

1 Like