A doubt on content cost function

Dear Mentor,

The normalization constant can be any number, and will be adjusted by hyperparameters alpha.

May i know is there any special meaning of the normalization constant 1/2 (hightlighted in yellow color) in this lecture?

Thank you.

The only reason there is a 1/2 there is so that when you compute the derivatives using the power rule, the 2 exponent will cancel with the 1/2, and become 1/1.

Dear JJaassoonn,

When performing gradient descent, we find the minimum of the cost function with respect to the weights. In principle, the cost function J(x) and any monotonic transformation of the cost function, like (1/2) * J(x) -since (1/2) is as positive number- will have the minimum for the same value of x=x_min.

The reason to add an (1/2) is simply for convinience. When you will find the derivative of the norm, since it is square, you will multiply by 2 (derivative of x^2 is 2*x), cancelling with (1/2) and therefore getting rid of the numerical constant.

Dear Mr Antoni Munar Ara

Does it mean the minimum cost of J(x) and (1/2)*J(x) are the same or very close to each other?

Thank you.

No, it means that the weights that give the minimum cost will be the same, regardless of whether you scale the cost by 1/2.

Dear Mr Tom Mosher,

J(w1) = min_cost_a
(1/2) * J(w2) = min_cost_b

w1 equals to w2
but min_cost_a not equals to min_cost_b

May i know whether this is correct?

Thank you.

The exact value of the cost is not important.
Finding the weights that give the minimum cost is what matters.

Dear Mr Tom Mosher,

I am confused with this statement, for example

Given 2 cost functions,

J(w1) = min_cost_a

(1/2) * J(w2) = min_cost_b

If we perform gradient descent to learn w1 and w2 that give the minimum cost, provided that the exact values of the cost are not important (getting different min_cost_a and min_cost_b).

May i know how to get the weights w1 = w2 in this case?

I am able to understand if the value of weight w1 is close to w2 with some tolerance value, but i would like to learn about how to get the weights w1 exactly same as w2, i.e. w1 = w2

Thank you.

Hi @JJaassoonn,

Consider a linear problem with squared error, the removal of the scaling factor 1/2 only changes the cost function from the green line to the gold line, because the cost value at every w position becomes doubled.

So,

  • the scaling factor does not change the position of the optimal w,
  • but it does change the gradient values at each training step,
  • and so the sizes of the steps do change.

Since gradient descent steps to the solution, the step size will inevitably affect where it ends - and by now I believe I have explained the situation so that you can find the answer yourself.

However, if you are interested to try, a question for you is, if the weight update algorithm is w := w - \alpha \frac{\partial{J}}{\partial{w}}, what change can you make to make sure that before and after the removal of 1/2, the two training processes will always end up at the same solution w value?

Cheers,
Raymond

1 Like

Dear Mr Raymond,

I think the learning rate alpha applied to the gradient of cost function (1/2)*J(w2) must be more than the learning rate alpha applied to the gradient of cost function J(w1) in order to catch up the step size of gradient descent of cost function J(w1) to have both same w1=w2 value.

May i know whether it is correct?

Thank you.

The two cost functions do not have to use exactly the same number of iterations or the same learning rate.

The point is to find the weights that give the minimum cost.

Hey @JJaassoonn,

If we remove the 1/2, the step sizes will be doubled, so in order for it to follow the same footsteps, the learning rate needs to be halved to reduce the step sizes back.

If you write the formula of both cases out, you will see that we are just passing the 1/2 between alpha and the gradient term.

Cheers,
Raymond

Dear Mr Raymond,

Thank you so much for your tutorial :smiley:.

Dear Mr Tom Mosher,

Thanks for your explanation.

You are welcome, @JJaassoonn!

Hi JJaassoon,

Not necessarily. What it means is that the point x = x* for which the function J(x) and (1/2)*J(x) reaches the minimum is the same, but the value of the minimum for J(x) and (1/2)*J(x) is different (in this case exactly one half since it is multiplied by 0.5).

So, let’s be x=x* the point where the minimum is reached for J(x). Then, J’(x) = (1/2) * J(x) also has a minimum in x=x*. But the value of the function minimum can be different. We are interested in finding x*, not in the value of J(x) itself. If we can find x* using another function J’(x) for which it is easier to compute it, then we take advantage.

In principle this trick works for any “monotonous transformation”. What is a monotonous transformation ? If you have a function f(x), a new function g(x) derived from f(x) is said to be a monotonous transformation, if for any x1 and x2 that f(x1) < f(x2), then g(x1) < g(x2) is also true.

Dear Mr Antoni Munar Ara,

Thank you so much for your guidance.