As you say and for the reasons that Ken explained, it can actually happen that you get exactly 0 or 1 as the output of sigmoid in floating point. In *float64* I think it only takes z > 35 to get exactly one. There are a couple of ways you can write the code to detect that has happened and avoid getting NaN or Inf as your cost value. Here’s a fairly recent thread on which we discussed that.

It’s an interesting thought to try random inititalization instead of zero initialization. My guess is that would probably not prevent the eventual saturation, since that’s a property of the shape of the cost surface, but it might change the number of iterations it takes you to get there. But this is just a guess: if you have a case that saturates with zero initialization, try random and see if it makes a difference. Let us know what you discover if you run that experiment!

Perhaps the one other interesting point (which you made in your initial post) is that the progress of back propagation is not impeded by getting a NaN value for J: the actual J value is not really used, other than as an inexpensive to compute proxy for how well your convergence is working. The gradients are expressed as functions and still have valid values even when J is NaN. But you are driving blind about convergence, as you said, so you would need to use another metric, e.g. prediction accuracy, but that is more work to compute. Or use one of the strategies on that other thread to avoid the NaN issue “artificially”.