Regarding Logistic Regression Cost Function

Orkhan_Mammadov · June 13, 2025, 8:38am

Why doesn’t the sigmoid + log cost function break when predictions are exactly 0 or 1?
In binary classification, we often use the sigmoid activation function followed by a cross-entropy (log) loss. The loss includes terms like log(ŷ) and log(1 - ŷ). Mathematically, this becomes problematic when the predicted probability ŷ is exactly 0 or 1, since log(0) is undefined and tends toward negative infinity.
Yet, in practice, the training doesn’t crash or explode. How is this issue handled under the hood? Are there numerical tricks or smoothing techniques applied by libraries (like TensorFlow or PyTorch) to prevent instability?
I want to understand this since it was not mentioned in either of course vides how this handled since I suppose we should use some small epsilons or something else while using log, etc…
Thanks

gent.spah · June 13, 2025, 11:11am

The prediction can never be those numbers because it is a probability prediction and will always fluctuate between 0% and 100%. No machine learning model can give an output with total certainty.

Apparently, the softmax implementation in popular libraries also uses a small epsilon value to prevent such scenarios.

Orkhan_Mammadov · June 13, 2025, 5:03pm

I see… thanks!
I was just wondering you know we can have 100,000’s of features/ training examples and automatically we can have a single instance with this particular “trap” which will “destroy” our model making it unstable.
I never saw anyone mentioning it and ways to avoid it, though I know some libraries have built-in functions for that.

Thanks, Sir!

TMosh · June 13, 2025, 5:17pm

I’m moving this thread out of the “Introductions” forum, and into “AI Discussions”.

paulinpaloalto · June 13, 2025, 5:35pm

It’s an interesting question with several levels to the answer:

From a purely mathematical standpoint, sigmoid(z) can never exactly equal 0 or 1. So you can’t end up with log(0) in the cross entropy loss function if you’re doing “pure math”.

But we don’t have the luxury of using “pure” math and the abstract beauty of \mathbb{R} here: we’re stuck with the limitations of finite floating point representations. In 32 bit or 64 bit floating point, the value of sigmoid can round to (“saturate” is the technical term) either 0 or 1 exactly. In that case, your loss function value can end up as either Inf or NaN.

But that doesn’t really wreck the convergence of back propagation, because we don’t actually use the final J or L value in that computation. We use the derivative of the loss function and you can see from the back prop formulas that the derivatives are fine if \hat{y} == 0 or 1.

If you want to make sure that your J values remain interpretable, you can add a defense mechanism against the saturation, as described on this thread. But as pointed out in the previous paragraph, that is not necessary in order for back propagation to continue working even in the saturation case. You can use accuracy instead of J as the proxy for how your convergence is working.

Orkhan_Mammadov · June 13, 2025, 6:55pm

Thanks, Paul!
Yep now it makes sense

I appreciate your time and attention!

abitrolly · June 19, 2025, 8:19am

I remember I abandoned the course once, which said “you don’t need to know” about some corner cases of matrix multiplication when calculating gradient descent. I had to discover this myself and got dragged into some unfortunately complicated math site. It would be much beneficial if the course could provide ELI5 links to help fill these knowledge gaps.

Topic		Replies	Views
Logistic loss function - divide by zero encountered in log Supervised ML: Regression and Classification week-module-3	3	960	January 31, 2023
Logistic Regression cost function with rounded off Sigmoid calculations Neural Networks and Deep Learning coursera-platform	5	701	April 6, 2022
Loss function for logistic regression Neural Networks and Deep Learning coursera-platform	2	613	December 28, 2021
Problem in Exercise 5 Week2 Neural Networks and Deep Learning coursera-platform	4	528	October 31, 2022
Cost function problem Neural Networks and Deep Learning coursera-platform	19	873	August 16, 2023

Regarding Logistic Regression Cost Function

Related topics