Getting wrong output for dZ2 using the backward test case

The function L_model_backward_test in of Wk4A1 uses the following data:
A2 or AL = array([[1.78862847, 0.43650985]])
Y = array([[1, 0]])
(Please refer to this thread to avoid confusion in variable naming in the mentioned function: Why is there no A0 or X in the backward chain? Week4, Assignment1)

The correct code (one that passes the test case) returns following value for dZ2 or dZL
array([[-0.1261204 0.42987761]]) [1st]

However, Wk3 lecture on Backpropagation Intuition (optional) suggests following equation to calculate dZ2 or dZL:
implying that dZ2 should be:
array([[0.78862847 0.43650985]]) [2nd]

Which of these two values is the correct value for dZ2? (Since I am using the [2nd] value in my code, which fails the mentioned test case and also have issues with model convergence whereas the [1st] value works fine, makes the model converge and gives the required 80% test accuracy).


Where do you manually need to compute dZ2? In L_model_backward, you manually compute dAL. Then you pass that to linear_activation_backward for the output layer.

I am writing code slight differently than what is mentioned in the assignment. I skip the dA2 or dAL calculation and am directly calculating dZ2 or dZL by the equation given by Prof. Ng, which is
dZ2 = A2 - Y
Now given that
A2 = [1.78862847, 0.43650985]
Y = [1, 0]
dZ2 = [0.78862847 0.43650985] (simply subtracting 1 from 1.788 to get 0.788)

But this dZ2 leads to different values for dW2 and db2 value than what is expected in the test case and thus it fails the test case.

Now, I need to understand why above equation does not work?

If you want to see the value of dZ2 returned by linear_activation_backward, just add a print statement in the elif ‘sigmoid’ clause in that function and you should see the following value for dZ2: [-0.1261204 0.42987761]
It’s unclear to me, what is causing the difference in the values since dA2*sigmoid_derivative is equal to A2-Y (Or is it not?)

But dA2 or dAL is a required input to the call to linear_activation_backward for the output layer. So your “rewrite” includes not calling linear_activation_backward for the output layer? That is the point at which the dZ value would be for layer 2, since this test case has only two layers, right?

Ah, ok, I tried some instrumentation and now I realize the issue:

Prof Ng’s formulas are just fine. The problem is that if you use his formulas at the output level and skip calling linear_activation_backward, then you hit the problem that the test case input values are just randomly generated. That means that they don’t satisfy all the same mathematical relationships that the real values generated by forward propagation would satisfy.

What happens when you call linear_activation_backward is that it is a general function that works for any layer, so it has to use the general formula for dZ^{[l]}:

dZ^{[l]} = dA^{[l]} \cdot g^{[l]'}(Z^{[l]})

But what you are doing by using the formula:

dZ^{[2]} = A^{[2]} - Y

is that you’ve “short-circuited” the calculation shown above in the general case with the special simplifications you get at the output layer. At the output layer, the derivative of the activation function is:

g'(Z) = g(Z) * (1 - g(Z)) = A * (1 - A)

because the activation is sigmoid. But if you look at the way that the test inputs are generated it is this (as we discussed on that other thread):

def L_model_backward_test(target):
    AL = np.random.randn(1, 2)
    Y = np.array([[1, 0]])

    A1 = np.random.randn(4,2)
    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    Z1 = np.random.randn(3,2)
    linear_cache_activation_1 = ((A1, W1, b1), Z1)

    A2 = np.random.randn(3,2)
    W2 = np.random.randn(1,3)
    b2 = np.random.randn(1,1)
    Z2 = np.random.randn(1,2)
    linear_cache_activation_2 = ((A2, W2, b2), Z2)

    caches = (linear_cache_activation_1, linear_cache_activation_2)

Since AL and Z2 are just random values, it is not the case that sigmoid(Z2) = AL. You can clearly see that, since the values of AL are not between 0 and 1, which any output of sigmoid would be.

So even if your code is correct, it may not pass the test cases unless you write the code using the general formulas that do not take into account any special relationships that are particular to the output layer.

There have been other instances of this, e.g. this one from Week 3.