Logistic Regression cost function with rounded off Sigmoid calculations

I was wondering about an issue that might occur with the Sigmoid function when using it to calculate the cost of a Logistic Regression algorithm. We know that mathematically it cannot equal 1 or it cannot equal 0, it only tends towards those values as the input gets larger or smaller.

But in computers, the calculation will eventually round off after enough decimal digits. Therefore, if we then calculate the costs of these results by calculating the Log of the rounded off 0’s or the Log of 1 minus the rounded off 1’s, then it would cause an error (Log of 0 is undefined).

It only really happens when the optimization is completely off (getting close to 1 when the label is a 0 and vice versa), but this could be the case in the initial iterations before it has begun to optimize.

In an implementation we could just avoid calculating the cost as it’s just the gradients that we are really after, but that means we won’t get any insights to see if the cost is decreasing for each iteration.

My question is, what is the convention to avoid these type of situations?

I am not sure that computers will “eventually round off after enough decimal digits” is an accurate portrayal of what happens in these situations. Computers are limited by the decimal precision allowed in very small numbers–whatever 32 or 64 bits can handle. This holds for extremely large numbers (in absolute value), too. If this number is the linear activation z, the the derivative (slope) of the sigmoid activation will “saturate”, i. e. the slope approaches zero. Python will eventually throw an exception and return an NaN.

This problem is typically avoided by “normalizing” the data so that is becomes unlikely that this problem arises. You will learn more about this in Course 2, so hang on. Your key phrase will be “batch norm,” so hang on.

The second point: we cannot avoid calculating the cost because it is the “objective function” (i.e. what we a trying to minimize) of the gradient descent algorithm. As you note, we cannot “see” the value of the cost during gradient descent. As @paulinpaloalto might say, “nothing good can come from that.”

Yeah, I meant that with enough digits the floating point number will eventually truncate or round off. Sigmoid of 100 in Python evaluates to 1 for example.

Ok cool, so normalizing will reduce the likelihood of this occurring. Would randomizing the initial weights (instead of zeroing) also reduce the likelihood? I’ll be looking forward to Course 2 to see this in more depth.

Yeah, not including the cost does sound like driving with your eyes closed, probably not the best idea.

Thank you for your response, it was really helpful.

As you say and for the reasons that Ken explained, it can actually happen that you get exactly 0 or 1 as the output of sigmoid in floating point. In float64 I think it only takes z > 35 to get exactly one. There are a couple of ways you can write the code to detect that has happened and avoid getting NaN or Inf as your cost value. Here’s a fairly recent thread on which we discussed that.

It’s an interesting thought to try random inititalization instead of zero initialization. My guess is that would probably not prevent the eventual saturation, since that’s a property of the shape of the cost surface, but it might change the number of iterations it takes you to get there. But this is just a guess: if you have a case that saturates with zero initialization, try random and see if it makes a difference. Let us know what you discover if you run that experiment!

Perhaps the one other interesting point (which you made in your initial post) is that the progress of back propagation is not impeded by getting a NaN value for J: the actual J value is not really used, other than as an inexpensive to compute proxy for how well your convergence is working. The gradients are expressed as functions and still have valid values even when J is NaN. But you are driving blind about convergence, as you said, so you would need to use another metric, e.g. prediction accuracy, but that is more work to compute. Or use one of the strategies on that other thread to avoid the NaN issue “artificially”.

Awesome, thanks for that link.

I did try an experiment with random weights vs zero weights at initialization and it seemed to not make a difference. Actually it might have made it slightly worse.

The only way I got it to not give me NaN values was to decrease the learning rate. Which makes sense I think, as I think it was overshooting the minima. If I used random initialization, even decreasing the learning rate didn’t stop it (but maybe I didn’t run enough tests).

However, in all cases it did converge towards a low cost, even after giving some NaN values at first, but for now I’ll leave this experiment with a low learning rate and the zero initialization.

Thank you again for your explanations.

EDIT: Just for clarity, this issue was not occurring during the assignment, I was just applying this same algorithm to a similar dataset on my local Python and NumPy installations and found out that I was getting some NaN values.

Cool! Thanks for sharing your experimental results. So we all learned something from this discussion, which is how we hope it should work! Onward! :nerd_face: