Regarding Logistic Regression Cost Function

Why doesn’t the sigmoid + log cost function break when predictions are exactly 0 or 1?
In binary classification, we often use the sigmoid activation function followed by a cross-entropy (log) loss. The loss includes terms like log(ŷ) and log(1 - ŷ). Mathematically, this becomes problematic when the predicted probability ŷ is exactly 0 or 1, since log(0) is undefined and tends toward negative infinity.
Yet, in practice, the training doesn’t crash or explode. How is this issue handled under the hood? Are there numerical tricks or smoothing techniques applied by libraries (like TensorFlow or PyTorch) to prevent instability?
I want to understand this since it was not mentioned in either of course vides how this handled since I suppose we should use some small epsilons or something else while using log, etc…
Thanks

1 Like

The prediction can never be those numbers because it is a probability prediction and will always fluctuate between 0% and 100%. No machine learning model can give an output with total certainty.

Apparently, the softmax implementation in popular libraries also uses a small epsilon value to prevent such scenarios.

3 Likes

I see… thanks!
I was just wondering you know we can have 100,000’s of features/ training examples and automatically we can have a single instance with this particular “trap” which will “destroy” our model making it unstable.
I never saw anyone mentioning it and ways to avoid it, though I know some libraries have built-in functions for that.

Thanks, Sir!

2 Likes

I’m moving this thread out of the “Introductions” forum, and into “AI Discussions”.

1 Like

It’s an interesting question with several levels to the answer:

From a purely mathematical standpoint, sigmoid(z) can never exactly equal 0 or 1. So you can’t end up with log(0) in the cross entropy loss function if you’re doing “pure math”.

But we don’t have the luxury of using “pure” math and the abstract beauty of \mathbb{R} here: we’re stuck with the limitations of finite floating point representations. In 32 bit or 64 bit floating point, the value of sigmoid can round to (“saturate” is the technical term) either 0 or 1 exactly. In that case, your loss function value can end up as either Inf or NaN.

But that doesn’t really wreck the convergence of back propagation, because we don’t actually use the final J or L value in that computation. We use the derivative of the loss function and you can see from the back prop formulas that the derivatives are fine if \hat{y} == 0 or 1.

If you want to make sure that your J values remain interpretable, you can add a defense mechanism against the saturation, as described on this thread. But as pointed out in the previous paragraph, that is not necessary in order for back propagation to continue working even in the saturation case. You can use accuracy instead of J as the proxy for how your convergence is working.

4 Likes

Thanks, Paul!
Yep now it makes sense

I appreciate your time and attention!

I remember I abandoned the course once, which said “you don’t need to know” about some corner cases of matrix multiplication when calculating gradient descent. I had to discover this myself and got dragged into some unfortunately complicated math site. It would be much beneficial if the course could provide ELI5 links to help fill these knowledge gaps.

1 Like