Cost function problem

There is one further point worth making here:

Even if you don’t fix your cost function to handle this case, it does no harm to the actual back propagation process. It may not be a priori obvious, but if you take a careful look at the back propagation logic, you will see that the actual scalar J value is not used anywhere. All we need are the gradients (derivatives) of J w.r.t. the various parameters. Because of the nice way that the derivative of sigmoid and the derivative of cross entropy loss work together, the vector derivative at the output layer ends up being:

dA^{[L]} = \displaystyle \frac {\partial L}{\partial A^{[L]}} = A^{[L]} - Y

so you can see that nothing bad happens to the derivatives if any of the A^{[L]} values are exactly 0 or 1.

In reality, the J value itself is really only useful as an inexpensive proxy for how well your convergence is working. You could also check that by computing the training accuracy periodically, but the code to do that is a bit more complicated (e.g. if you’re using regularization, you have to disable that for the predictions to evaluate the accuracy) and computationally expensive.

2 Likes