I have a question, when we define dA[L] manually, using log loss ,we say dA[L] =- (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)), where AL is the output of last layer (y_hat), Isn’t it better to add epsilon (1e-10 for example) to AL to avoid division by zero if AL exactly equals 1 to get more computational stability or Am I getting something wrong?
In theory you are correct.
In practice, AL will not reach exactly 0 or 1 except in the extreme limits. With trained weight values, this will not likely occur.
So this is not an issue you need to worry about very much.
When I tried to practice to implement dropout regularization using the data given in the lab for week 1 course 2, it happened to me saying that Im trying to divide by zero when I debug it in both my code and the lab I saw that AL really contains 1s, but I didnt know how it was handled in the lab imported functions but when I added epsilon I get the same results, so I don’t know was it luck or it was correct?
[[9.99992040e-01 1.00000000e+00 9.99999988e-01 9.99999979e-01
9.99999983e-01 9.99521204e-01 9.65090199e-01 9.77719975e-01
9.99999351e-01 9.99999987e-01 1.00000000e+00 1.00000000e+00
9.99999998e-01 9.99999992e-01 9.98904808e-01 1.00000000e+00
This is a sample I saw in the lab.
You are not incorrect.
Sorry but i cannot explore this dataset at the moment. Perhaps another mentor will be able to run some tests and reply here.
Yes, you can run into problems if the sigmoid values round to exactly 0 or exactly 1. Of course mathematically, they would never exactly equal 0 or 1, but in floating point you can run out of resolution.
Here’s a thread which shows discusses the strategy you also mentioned of perturbing the values slightly to make sure you avoid the exactly 0 or 1 cases.
I get it now, it’s all about number approximation for float numbers which lead to 1 or 0 for computer representation.
Thank you very much for this clarification.