I want to implement neural network. My problem is when I want to calculate cost function some of my answers are become like this: 0 * log(0). Do you know how can I handle this in python?

This is an interesting question! Of course the \hat{y} values are the output of sigmoid, so they can never be exactly 0 or 1 mathematically. But here we are dealing with the finite limitations of floating point representations, not the abstract beauty of \mathbb{R}, so it can actually happen that the values â€śsaturateâ€ť and end up rounding to be exactly 0 or 1.

There are several ways to handle that:

You can test your \hat{y} values for exact equality to 0 or 1 and then slightly perturb the values before the cost computation:

```
A[A == 0.] = 1e-10
A[A == 1.] = 1. - 1e-10
```

You can also use *isinf()* and *isnan()* to replace any saturated values that happen after the fact, although thatâ€™s a bit more code since you need to catch the bad values while the cost is still the â€ślossâ€ť in vector form.

`loss[np.isnan(loss) | np.isneginf(loss)] = 42.`

You could replace the non-numeric values with 0., but the point is those cases represent a big error that should be punished pretty severely by the loss function. Of course the actual J value doesnâ€™t really affect the gradients in any case: the derivatives are calculated separately.

You can look up the documentation for numpy *isnan()* and *isneginf()*. There are two cases to worry about:

If the \hat{y} is 1 and the y is 0, then you get 1 * -\infty for the (1 - Y) term, which is -\infty. But if \hat{y} is 1 and the y is 1, then you get 0 * -\infty for the (1 - Y) term and that is NaN (not a number). Of course you have the same cases in the opposite order for the Y term, when you hit \hat{y} = 0.

There is one further point worth making here:

Even if you donâ€™t fix your cost function to handle this case, it does no harm to the actual back propagation process. It may not be *a priori* obvious, but if you take a careful look at the back propagation logic, you will see that the actual scalar J value is not used anywhere. All we need are the gradients (derivatives) of J w.r.t. the various parameters. Because of the nice way that the derivative of sigmoid and the derivative of cross entropy loss work together, the vector derivative at the output layer ends up being:

dA^{[L]} = \displaystyle \frac {\partial L}{\partial A^{[L]}} = A^{[L]} - Y

so you can see that nothing bad happens to the derivatives if any of the A^{[L]} values are exactly 0 or 1.

In reality, the J value itself is really only useful as an inexpensive proxy for how well your convergence is working. You could also check that by computing the training accuracy periodically, but the code to do that is a bit more complicated (e.g. if youâ€™re using regularization, you have to disable that for the predictions to evaluate the accuracy) and computationally expensive.

Thanks for your answering. I used your answer but my question is why are you put 0 for -inf too? because -inf happend when for example yhat = 0 and y_train = 1. And Iâ€™m coding for logistic neural network with no hidden layer and I get negative cost and another problem that I get in backward part is that all of db and dw and w and b are updating but my accuracy do not get more than 27%?

Itâ€™s a good point that maybe it would make more sense to use a big number for the -\infty case. That will happen automatically, if you use the simpler method that I gave first:

You can check that log(10^{-10}) \approx -23..

As to why your algorithm doesnâ€™t work well, there are lots of things to check. If the cost is negative, that just means your code is wrong. The logs are negative, but each term is multiplied by -1 to take care of that, right?

For the 27% accuracy problem, maybe your back prop code is not correct or perhaps youâ€™ve just chosen a problem for which Logistic Regression is simply not a good solution. Note that LR can only do linear separation between the â€śyesâ€ť and â€śnoâ€ť answers and not all problems are that cleanly separable. One way to check whether itâ€™s your code or your dataset would be to try your hand written code on a case where you know what the answers should be, e.g. the example in the notebook. Can you get the same results with the â€ścat/not catâ€ť image dataset in this assignment if you use your new code?

I tested this. But I canâ€™t use -1 for that because some of the costs are negative some of them positive. I normalize my data with this formula It shows to be ok:

```
X_train = (X_train - X_train.mean()) / (X_train.max() - X_train.min())
```

Is it possible for you to see my code on colab and see there is something wrong with my code?

The other problem that I see happend with normalizing is that the accuracy gets zero.

What do you mean some of your cost values are positive and some are negative? How you normalize the data should not have any effect on whether your cost values are positive or negative. Look at the formula again. The log of a number between 0 and 1 is always negative, right? So how could this number not always be positive:

-1 * \left( y^{[i]} * log(\hat{y}^{[i]}) + (1 - y^{[i]}) * log(1 - \hat{y}^{[i]}) \right)

I think this must mean that your *sigmoid* implementation is wrong. Check that the output is always between 0 and 1.

Did this work when you did it in the notebook? Have you compared your code to the code you wrote in the notebook?

The formula you mention is for loss function. Cost function is

```
cost = (-1/m) * np.sum(loss)
```

This is my output for 20 epoch:

Cost after iteration 0: 756.351655

Cost after iteration 1: 138.983478

Cost after iteration 2: 52.309643

Cost after iteration 3: 31.594968

Cost after iteration 4: 21.663387

Cost after iteration 5: 15.557236

Cost after iteration 6: 11.262166

Cost after iteration 7: 7.971122

Cost after iteration 8: 5.295935

Cost after iteration 9: 3.026235

Cost after iteration 10: 1.037917

Cost after iteration 11: -0.747046

Cost after iteration 12: -2.380204

Cost after iteration 13: -3.897006

Cost after iteration 14: -5.322635

Cost after iteration 15: -6.675483

Cost after iteration 16: -7.969295

Cost after iteration 17: -9.214556

Cost after iteration 18: -10.419398

Cost after iteration 19: -11.590231

Cost after iteration 20: -12.732175

The formula I gave is the loss for one sample, right? It is what is the argument to np.sum in your version of the code.

You didnâ€™t answer my question: how can that ever be negative?

Try adding this assertion after your forward propagation and before you compute the cost:

`assert(np.all(A >= 0) and np.all(A <= 1))`

I bet that fails in your code.

Yes, The problem of getting negative answer is not for loss function. Itâ€™s for cost function.

I donâ€™t mention that the loss function is negative. My code for cost function is this.

Without normalizing:

*{moderator edit - solution code removed}*

output:

Cost after iteration 0: 368.093138

Cost after iteration 1: 144.517149

Cost after iteration 2: 91.156646

Cost after iteration 3: 63.049115

Cost after iteration 4: 45.477685

Cost after iteration 5: 33.594468

Cost after iteration 6: 25.128624

Cost after iteration 7: 18.837105

Cost after iteration 8: 13.983219

Cost after iteration 9: 10.109474

Cost after iteration 10: 6.922259

Cost after iteration 11: 4.227926

Cost after iteration 12: 1.895558

Cost after iteration 13: -0.165420

Cost after iteration 14: -2.019039

Cost after iteration 15: -3.711491

Cost after iteration 16: -5.276724

Cost after iteration 17: -6.740122

Cost after iteration 18: -8.120964

Cost after iteration 19: -9.434110

With normalizing:

*{moderator edit - solution code removed}*

output:

Cost after iteration 0: 736.414184

Cost after iteration 1: 183.849585

Cost after iteration 2: 60.068489

Cost after iteration 3: 34.541877

Cost after iteration 4: 23.181587

Cost after iteration 5: 16.458523

Cost after iteration 6: 11.838841

Cost after iteration 7: 8.354172

Cost after iteration 8: 5.552774

Cost after iteration 9: 3.195061

Cost after iteration 10: 1.141963

Cost after iteration 11: -0.692855

Cost after iteration 12: -2.365860

Cost after iteration 13: -3.915554

Cost after iteration 14: -5.369102

Cost after iteration 15: -6.746224

Cost after iteration 16: -8.061582

Cost after iteration 17: -9.326311

Cost after iteration 18: -10.549015

Cost after iteration 19: -11.736451

But when I normalized I get 0 accuracy.

What about your A values? I bet they are wrong.

And why do you need *X_train* as an argument to the cost function? You can get m from the shape of *Y_train*, right?

I tested the code that you said It didnâ€™t have problem.

This is my forward propagation code:

*{moderator edit - solution code removed}*

Yes, Both of them have same value. My X_train shape is (16, 384) and for y_train is (1, 384)

Hmmm, everything you show looks correct to me. So either thatâ€™s not the actual code you are running or the bug is in the higher level logic that invokes those lower level routines.

This is my whole code:

*{moderator edit - solution code removed}*

output:

Accuracy 0 22.656249999999996

Cost after iteration 0: 16.437509

Accuracy 1 24.739583333333332

Cost after iteration 1: -3.815853

Accuracy 2 25.520833333333332

Cost after iteration 2: -4.419817

Accuracy 3 27.083333333333332

Cost after iteration 3: -4.422150

Accuracy 4 27.34375

Cost after iteration 4: -4.280212

Accuracy 5 27.34375

Cost after iteration 5: -4.333344

Accuracy 6 27.34375

Cost after iteration 6: -4.331615

Accuracy 7 27.34375

Cost after iteration 7: -4.322271

Accuracy 8 27.34375

Cost after iteration 8: -4.366474

Accuracy 9 27.34375

Cost after iteration 9: -4.329811

Accuracy 10 27.34375

Cost after iteration 10: -4.357031

Accuracy 11 27.34375

Cost after iteration 11: -4.343117

Accuracy 12 27.34375

Cost after iteration 12: -4.361230

Accuracy 13 27.34375

Cost after iteration 13: -4.379249

Accuracy 14 27.604166666666664

Cost after iteration 14: -4.361586

Accuracy 15 27.604166666666664

Cost after iteration 15: -4.336942

Accuracy 16 27.604166666666664

Cost after iteration 16: -4.308309

Accuracy 17 27.604166666666664

Cost after iteration 17: -4.311706

Accuracy 18 27.604166666666664

Cost after iteration 18: -4.315046

Accuracy 19 27.604166666666664

Cost after iteration 19: -4.318340

And accuracy for test is:

Accuracy 0 21.874999999999996

Accuracy 1 21.874999999999996

Accuracy 2 21.874999999999996

Accuracy 3 21.874999999999996

Accuracy 4 20.833333333333332

Accuracy 5 20.833333333333332

Accuracy 6 20.833333333333332

Accuracy 7 20.833333333333332

Accuracy 8 19.791666666666664

Accuracy 9 19.791666666666664

Accuracy 10 22.916666666666664

Accuracy 11 23.958333333333332

Accuracy 12 23.958333333333332

Accuracy 13 22.916666666666664

Accuracy 14 23.958333333333332

Accuracy 15 26.041666666666664

Accuracy 16 26.041666666666664

Accuracy 17 26.041666666666664

Accuracy 18 26.041666666666664

Accuracy 19 27.083333333333332

And the shapes are like this:

X_train: (16, 384)

y_train: (1, 384)

X_val: (16, 96)

y_val: (1, 96)

Are you sure that all your *y_train* values are either 0. or 1.?

No. As I checked now Itâ€™s 0, 1 and 2.

Thank you for mentioning that. I replace 2 values with 1 And I get 70% accuracy.

Great! The point here is that Logistic Regression is doing binary classification, which means that the labels can only be 0 or 1. Otherwise nothing makes sense (either the sigmoid output nor the loss calculation).