A high value for the cost function causes a greater step downhill for gradient descent?

Hello folks,

I have a question about the impact of high value for the cost (loss) on the update of the parameters using gradient descent. Suppose for example in the case of logistic regression the true value of the output yi is 0 and the prediction of our model fwb is 1, in this case the binary cross entropy loss function will penalize this choice with a high value (close to infinity) for the loss. What happens next? Does this big value for the loss cause a greater step downhill (or in any other direction) towards the right value? How does the calculation of the gradient of the loss funcion will behave given that we found a big value for the cost (loss)?

Thank you very much,
Ivomar

1 Like

Hi Ivomar,

The loss can be very high, but the gradient of a weight isn’t proportional to it, but the corresponding feature values and probability errors which, in your example, will just be 1-0=1

The gradient formula:
\frac{\partial{J}}{\partial{w_j}} = \frac{1}{m}\sum_{i}{(a^{(i)} - y^{(i)})}x_j^{(i)}

Cheers!
Raymond

Yes, I think that when the cost is more then the steps taken would be larger as the alpha value is kept the same. That is why as we get nearer to the global minimum and the cost is low the step taken is very small.
PS: Do let me know if I am wrong!

Hey Ardent,

To be more accurate, the step is proportional to the gradient (and \alpha) and the gradient is proportional to the probability’s error (and the feature values) but not the cost value. Although you might say that, the higher the error, the higher the cost, it is the error that decides the gradient which decides the step size, but not the cost value.

The formula again:
\frac{\partial{J}}{\partial{w_j}} = \frac{1}{m}\sum_{i}{(a^{(i)} - y^{(i)})}x_j^{(i)}
a^{(i)} is the predicted probability for sample i, y^{(i)} is the true label, and (a^{(i)} - y^{(i)}) is the error I have been talking about.

So, as we get near to the global minimum, the cost is low, but the key is the probability’s errors are also low, so the gradient is low, so the step size becomes smaller.

Raymond

Whats the difference if for example for logistic regression if we have a very high cost value (e.g., close to infinity) and for example a cost value of 5? Shouldn’t the infinity cost value trigger a much more aggressive shift of direction of gradient descent in the right direction than for example a cost value of 5?

Thanks again for your explanations,
Ivomar

Hi Ivomar,

(I am sticking with the course’s convention to use “loss” to refer to the loss for a sample, as opposed to “cost” which refers to total loss of all samples)

We see a high loss value for a sample, because the probability error is high, not the other way around. On the other hand, gradient descent tells us that the step size depends on the gradient which depends on the probability error, so in this sense the probability error is really the center of the discussion.

Because we have high error, our loss value is high, and because we have high error, the gradient descent step is large. I am trying to deliver the idea that, we are not driving the gradient descent by the loss value itself.

Loss for a sample: -y^{(i)}\log(a^{(i)}) - (1-y^{(i)})\log(1- a^{(i)})
Gradient contributed by a sample: (a^{(i)} - y^{(i)})x^{(i)}

So if y^{(i)}=0,
when a^{(i)}=0.9, the loss is 2.3, and the gradient contributed is 0.9x^{(i)},
when a^{(i)}=0.1, the loss is 0.1, and the gradient contributed is 0.1x^{(i)}

The loss value is unbounded, but the probability error is bounded between 0 and 1. And indeed it is the error that determines the gradient descent step, not the loss value.

Cheers!
Raymond

2 Likes

Thank you a lot, that clarify my doubts!

You are welcome Ivomar!

Thank your for this clarification. I was under the impression that cost and error are basically the same thing.

Hey Ardent, it’s great to hear that my answer helped. Happy learning!
Raymond