Back propagation of last sigmoid layer

I have a question, when we define dA[L] manually, using log loss ,we say dA[L] =- (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)), where AL is the output of last layer (y_hat), Isn’t it better to add epsilon (1e-10 for example) to AL to avoid division by zero if AL exactly equals 1 to get more computational stability or Am I getting something wrong?

In theory you are correct.

In practice, AL will not reach exactly 0 or 1 except in the extreme limits. With trained weight values, this will not likely occur.

So this is not an issue you need to worry about very much.

1 Like

When I tried to practice to implement dropout regularization using the data given in the lab for week 1 course 2, it happened to me saying that Im trying to divide by zero when I debug it in both my code and the lab I saw that AL really contains 1s, but I didnt know how it was handled in the lab imported functions but when I added epsilon I get the same results, so I don’t know was it luck or it was correct?

[[9.99992040e-01 1.00000000e+00 9.99999988e-01 9.99999979e-01
9.99999983e-01 9.99521204e-01 9.65090199e-01 9.77719975e-01
9.99999351e-01 9.99999987e-01 1.00000000e+00 1.00000000e+00
9.99999998e-01 9.99999992e-01 9.98904808e-01 1.00000000e+00

This is a sample I saw in the lab.

You are not incorrect.

Sorry but i cannot explore this dataset at the moment. Perhaps another mentor will be able to run some tests and reply here.

1 Like

Yes, you can run into problems if the sigmoid values round to exactly 0 or exactly 1. Of course mathematically, they would never exactly equal 0 or 1, but in floating point you can run out of resolution.

Here’s a thread which shows discusses the strategy you also mentioned of perturbing the values slightly to make sure you avoid the exactly 0 or 1 cases.

1 Like

I get it now, it’s all about number approximation for float numbers which lead to 1 or 0 for computer representation.
Thank you very much for this clarification.

1 Like