Cost function problem

I want to implement neural network. My problem is when I want to calculate cost function some of my answers are become like this: 0 * log(0). Do you know how can I handle this in python?

This is an interesting question! Of course the \hat{y} values are the output of sigmoid, so they can never be exactly 0 or 1 mathematically. But here we are dealing with the finite limitations of floating point representations, not the abstract beauty of \mathbb{R}, so it can actually happen that the values “saturate” and end up rounding to be exactly 0 or 1.

There are several ways to handle that:

You can test your \hat{y} values for exact equality to 0 or 1 and then slightly perturb the values before the cost computation:

A[A == 0.] = 1e-10
A[A == 1.] = 1. - 1e-10

You can also use isinf() and isnan() to replace any saturated values that happen after the fact, although that’s a bit more code since you need to catch the bad values while the cost is still the “loss” in vector form.

loss[np.isnan(loss) | np.isneginf(loss)] = 42.

You could replace the non-numeric values with 0., but the point is those cases represent a big error that should be punished pretty severely by the loss function. Of course the actual J value doesn’t really affect the gradients in any case: the derivatives are calculated separately.

You can look up the documentation for numpy isnan() and isneginf(). There are two cases to worry about:

If the \hat{y} is 1 and the y is 0, then you get 1 * -\infty for the (1 - Y) term, which is -\infty. But if \hat{y} is 1 and the y is 1, then you get 0 * -\infty for the (1 - Y) term and that is NaN (not a number). Of course you have the same cases in the opposite order for the Y term, when you hit \hat{y} = 0.

1 Like

There is one further point worth making here:

Even if you don’t fix your cost function to handle this case, it does no harm to the actual back propagation process. It may not be a priori obvious, but if you take a careful look at the back propagation logic, you will see that the actual scalar J value is not used anywhere. All we need are the gradients (derivatives) of J w.r.t. the various parameters. Because of the nice way that the derivative of sigmoid and the derivative of cross entropy loss work together, the vector derivative at the output layer ends up being:

dA^{[L]} = \displaystyle \frac {\partial L}{\partial A^{[L]}} = A^{[L]} - Y

so you can see that nothing bad happens to the derivatives if any of the A^{[L]} values are exactly 0 or 1.

In reality, the J value itself is really only useful as an inexpensive proxy for how well your convergence is working. You could also check that by computing the training accuracy periodically, but the code to do that is a bit more complicated (e.g. if you’re using regularization, you have to disable that for the predictions to evaluate the accuracy) and computationally expensive.

1 Like

Thanks for your answering. I used your answer but my question is why are you put 0 for -inf too? because -inf happend when for example yhat = 0 and y_train = 1. And I’m coding for logistic neural network with no hidden layer and I get negative cost and another problem that I get in backward part is that all of db and dw and w and b are updating but my accuracy do not get more than 27%?

It’s a good point that maybe it would make more sense to use a big number for the -\infty case. That will happen automatically, if you use the simpler method that I gave first:

You can check that log(10^{-10}) \approx -23..

As to why your algorithm doesn’t work well, there are lots of things to check. If the cost is negative, that just means your code is wrong. The logs are negative, but each term is multiplied by -1 to take care of that, right?

For the 27% accuracy problem, maybe your back prop code is not correct or perhaps you’ve just chosen a problem for which Logistic Regression is simply not a good solution. Note that LR can only do linear separation between the “yes” and “no” answers and not all problems are that cleanly separable. One way to check whether it’s your code or your dataset would be to try your hand written code on a case where you know what the answers should be, e.g. the example in the notebook. Can you get the same results with the “cat/not cat” image dataset in this assignment if you use your new code?

1 Like

I tested this. But I can’t use -1 for that because some of the costs are negative some of them positive. I normalize my data with this formula It shows to be ok:

X_train = (X_train - X_train.mean()) / (X_train.max() - X_train.min())

Is it possible for you to see my code on colab and see there is something wrong with my code?
The other problem that I see happend with normalizing is that the accuracy gets zero.

What do you mean some of your cost values are positive and some are negative? How you normalize the data should not have any effect on whether your cost values are positive or negative. Look at the formula again. The log of a number between 0 and 1 is always negative, right? So how could this number not always be positive:

-1 * \left( y^{[i]} * log(\hat{y}^{[i]}) + (1 - y^{[i]}) * log(1 - \hat{y}^{[i]}) \right)

I think this must mean that your sigmoid implementation is wrong. Check that the output is always between 0 and 1.

Did this work when you did it in the notebook? Have you compared your code to the code you wrote in the notebook?

1 Like

The formula you mention is for loss function. Cost function is

cost = (-1/m) * np.sum(loss)

This is my output for 20 epoch:

Cost after iteration 0: 756.351655
Cost after iteration 1: 138.983478
Cost after iteration 2: 52.309643
Cost after iteration 3: 31.594968
Cost after iteration 4: 21.663387
Cost after iteration 5: 15.557236
Cost after iteration 6: 11.262166
Cost after iteration 7: 7.971122
Cost after iteration 8: 5.295935
Cost after iteration 9: 3.026235
Cost after iteration 10: 1.037917
Cost after iteration 11: -0.747046
Cost after iteration 12: -2.380204
Cost after iteration 13: -3.897006
Cost after iteration 14: -5.322635
Cost after iteration 15: -6.675483
Cost after iteration 16: -7.969295
Cost after iteration 17: -9.214556
Cost after iteration 18: -10.419398
Cost after iteration 19: -11.590231
Cost after iteration 20: -12.732175

The formula I gave is the loss for one sample, right? It is what is the argument to np.sum in your version of the code.

You didn’t answer my question: how can that ever be negative?

1 Like

Try adding this assertion after your forward propagation and before you compute the cost:

assert(np.all(A >= 0) and np.all(A <= 1))

I bet that fails in your code.

1 Like

Yes, The problem of getting negative answer is not for loss function. It’s for cost function.
I don’t mention that the loss function is negative. My code for cost function is this.
Without normalizing:

{moderator edit - solution code removed}

output:
Cost after iteration 0: 368.093138
Cost after iteration 1: 144.517149
Cost after iteration 2: 91.156646
Cost after iteration 3: 63.049115
Cost after iteration 4: 45.477685
Cost after iteration 5: 33.594468
Cost after iteration 6: 25.128624
Cost after iteration 7: 18.837105
Cost after iteration 8: 13.983219
Cost after iteration 9: 10.109474
Cost after iteration 10: 6.922259
Cost after iteration 11: 4.227926
Cost after iteration 12: 1.895558
Cost after iteration 13: -0.165420
Cost after iteration 14: -2.019039
Cost after iteration 15: -3.711491
Cost after iteration 16: -5.276724
Cost after iteration 17: -6.740122
Cost after iteration 18: -8.120964
Cost after iteration 19: -9.434110

With normalizing:

{moderator edit - solution code removed}

output:
Cost after iteration 0: 736.414184
Cost after iteration 1: 183.849585
Cost after iteration 2: 60.068489
Cost after iteration 3: 34.541877
Cost after iteration 4: 23.181587
Cost after iteration 5: 16.458523
Cost after iteration 6: 11.838841
Cost after iteration 7: 8.354172
Cost after iteration 8: 5.552774
Cost after iteration 9: 3.195061
Cost after iteration 10: 1.141963
Cost after iteration 11: -0.692855
Cost after iteration 12: -2.365860
Cost after iteration 13: -3.915554
Cost after iteration 14: -5.369102
Cost after iteration 15: -6.746224
Cost after iteration 16: -8.061582
Cost after iteration 17: -9.326311
Cost after iteration 18: -10.549015
Cost after iteration 19: -11.736451

But when I normalized I get 0 accuracy.

What about your A values? I bet they are wrong.

And why do you need X_train as an argument to the cost function? You can get m from the shape of Y_train, right?

1 Like

I tested the code that you said It didn’t have problem.
This is my forward propagation code:

{moderator edit - solution code removed}

Yes, Both of them have same value. My X_train shape is (16, 384) and for y_train is (1, 384)

Hmmm, everything you show looks correct to me. So either that’s not the actual code you are running or the bug is in the higher level logic that invokes those lower level routines.

This is my whole code:

{moderator edit - solution code removed}

output:
Accuracy 0 22.656249999999996
Cost after iteration 0: 16.437509
Accuracy 1 24.739583333333332
Cost after iteration 1: -3.815853
Accuracy 2 25.520833333333332
Cost after iteration 2: -4.419817
Accuracy 3 27.083333333333332
Cost after iteration 3: -4.422150
Accuracy 4 27.34375
Cost after iteration 4: -4.280212
Accuracy 5 27.34375
Cost after iteration 5: -4.333344
Accuracy 6 27.34375
Cost after iteration 6: -4.331615
Accuracy 7 27.34375
Cost after iteration 7: -4.322271
Accuracy 8 27.34375
Cost after iteration 8: -4.366474
Accuracy 9 27.34375
Cost after iteration 9: -4.329811
Accuracy 10 27.34375
Cost after iteration 10: -4.357031
Accuracy 11 27.34375
Cost after iteration 11: -4.343117
Accuracy 12 27.34375
Cost after iteration 12: -4.361230
Accuracy 13 27.34375
Cost after iteration 13: -4.379249
Accuracy 14 27.604166666666664
Cost after iteration 14: -4.361586
Accuracy 15 27.604166666666664
Cost after iteration 15: -4.336942
Accuracy 16 27.604166666666664
Cost after iteration 16: -4.308309
Accuracy 17 27.604166666666664
Cost after iteration 17: -4.311706
Accuracy 18 27.604166666666664
Cost after iteration 18: -4.315046
Accuracy 19 27.604166666666664
Cost after iteration 19: -4.318340

And accuracy for test is:
Accuracy 0 21.874999999999996
Accuracy 1 21.874999999999996
Accuracy 2 21.874999999999996
Accuracy 3 21.874999999999996
Accuracy 4 20.833333333333332
Accuracy 5 20.833333333333332
Accuracy 6 20.833333333333332
Accuracy 7 20.833333333333332
Accuracy 8 19.791666666666664
Accuracy 9 19.791666666666664
Accuracy 10 22.916666666666664
Accuracy 11 23.958333333333332
Accuracy 12 23.958333333333332
Accuracy 13 22.916666666666664
Accuracy 14 23.958333333333332
Accuracy 15 26.041666666666664
Accuracy 16 26.041666666666664
Accuracy 17 26.041666666666664
Accuracy 18 26.041666666666664
Accuracy 19 27.083333333333332

And the shapes are like this:
X_train: (16, 384)
y_train: (1, 384)
X_val: (16, 96)
y_val: (1, 96)

Are you sure that all your y_train values are either 0. or 1.?

1 Like

No. As I checked now It’s 0, 1 and 2.

Thank you for mentioning that. I replace 2 values with 1 And I get 70% accuracy.

Great! The point here is that Logistic Regression is doing binary classification, which means that the labels can only be 0 or 1. Otherwise nothing makes sense (either the sigmoid output nor the loss calculation).

1 Like