The cost will become NaN if your \hat{y} value rounds to exactly 0 or 1. Of course \hat{y} is the output of the sigmoid
function, so it can never be exactly 0 or 1 if we are doing pure math using \mathbb{R}. But in floating point, everything is an approximation and we can end up with exactly 0 or 1. You have several approaches to deal with that:
- The first approach is to understand in more detail what is happening. E.g. instrument your code to track how close the values are getting to 0 or 1. In 64 bit floating point, I think z > 35 is enough to give you sigmoid(z) = 1. Maybe you need to use a smaller learning rate or a smaller iteration count. Of course it also matters what the accuracy of your predictions is.
- You can actually put a defense mechanism into your cost logic to protect against the rounding to 0 or 1. Here’s a thread which discusses that in more detail.