Why is there no A0 or X in the backward chain? Week4, Assignment1

Week 3 video titled “Backpropagation Intuition (Optional)” (https://www.coursera.org/learn/neural-networks-deep-learning/lecture/6dDj7/backpropagation-intuition-optional#)" (timestamp, 14:21) suggests that to calculate dW1, we definitely need A0 or X (refer to attached screenshot)

However, the L_model_backward_test function in public_tests.py for Wk4A1 expects correct output for dW1 based on just AL, Y, A2, W2, b2, Z2, A1, W1, b1, Z1 (refer to attached screenshot)

So the question is: Why does it not expect A0 or X as an input for calculation of dW1? Can we calculate dW1 even if we are not given X?

Thanks

1 Like

Hi @Sureto

X is not explicitly required as an input to the backward function if the forward propagation steps already used X to compute activations like A1 . So, dW1 can be calculated using the chain rule of backpropagation with the given intermediate activations.

Hope it helps!

2 Likes

In back propagation, everything happens in the opposite direction as forward propagation. As Alireza points out, the values of X are already “baked into” all the activations of the network through forward propagation, so it’s not that X is ignored or that it has no effect on what happens. It’s just that we don’t directly need it to compute any of the derivatives we need for the gradients of the various W and b values.

One other point to notice is that back propagation actually produces a gradient dA0, but we just discard it. The point is that we can’t modify the X values: they are the training data. So that gradient is just discarded. It’s been a long time since I have watched all these lectures, but I think I remember that Prof Ng makes some comment about that at some point in the Week 4 lectures. At each layer in back prop, the output includes dW^{[l]}, db^{[l]} and dA^{[l-1]}. It’s the dA^{[l-1]} that will be the input for the next (previous) layer gradient calculations. So when we do the last step and run that for l = 1 because we need dW^{[1]} and db^{[1]}, we get dA^{[0]} as a side effect, but it is not used.

4 Likes

Yes, I like Paul’s explanation, I was looking back at this myself and trying to understand it, though he has a clearer explanation, but the key thing is in backprop, at the last step, obviously we are not altering our/updating our data nor at that stage are we directly referring to it.

At that point, it is the ‘ghost in the machine’.

2 Likes

Hey, thanks for the reply.

Please refer to following equation:
image
I have already calculated dZ1 and m in my code and now I want to calculate dW1. For that calculation, the equation suggests multiplication with X.Transpose, following which I can write:
dW1 = np.dot(dZ1,X.T)/m
But X doesn’t exist in cache for the test case.


Again, going by the equations given in assignment 1 in the above screenshot, we can clearly see in equation (8) that computation of dW1 will require m, dZ1 and A0.Transpose. So the question still remains, where does A0 come from?.

Is the test case somehow able to test the backward code without even using A0 in its cache?

Or, Is there an alternate equation being used to calculate dW1 than the one mentioned above?

Note that I understand that dA0 isn’t important and is discarded because we are not updating the input data. But the main point is about availability of A0 for computation of dW1.

1 Like

Sorry, I did not read your original post carefully enough obviously. Well, we build everything in layers here. The cache value produced by forward prop at each layer looks like this:

((A, W, b), Z)

So it’s a 2-tuple, the first element of which is a 3-tuple (the “linear cache”) and the second element is the single value that is the “activation cache”. We wrote the code during forward propagation that creates that cache. Go back and take a look at how that logic works to refresh your memory. What you will find is that the linear cache at layer l is actually the values (A^{[l -1]}, W^{[l]}, b^{[l]}) which were the input parameters passed to linear_forward. So for l = 1, you are getting A^{[0]} = X.

2 Likes

Yes, thank you. I now understand that linear_forward ultimately fetches A0 (passed on as A_prev by linear_activation_forward in the assignment code) from the cache provided to the function. The function linear_activation_forward, in turn, receives A_prev from the function L_model_forward which passes X as the first A_prev value.

However, it is still unclear to me what is the value of X being used for test case. I expected a python code like the following:
np.random.seed(3)
X = np.random.randn(3,2)
I am trying to test my code using the test cases provided in the assignment. The l_model_backward_test in public_tests.py expects correct values for say dW1 (ref. to following image)

However I don’t know what value of X is being used for that expectation since the test_case doesn’t define it and put it in the cache (ref. to following image)

Is it possible to provide the random seed being used to initialize X?
Thanks

1 Like

Hello, @Sureto,

If you scroll a bit further down, you will see:

image

which is part of a list of 3 test cases. Each case has a key "input" which tells you the inputs to the L_model_backward function. Those inputs are defined in the body of l_model_backward_test().

To be specific, the following three variables are used:
image

Cheers,
Raymond

1 Like

Hey Raymond, thanks for the reply.

But I still don’t see where X is initialized in those test cases since X is essential to compute dW1 and only then comparison with expected_dW1 is possible.

The only variables initialized and used in the function are:
AL
Y
A1,W1,b1,Z1 i.e. linear_cache_activation_1
A2,W2,b2,Z2 i.e. linear_cache_activation_2

Thanks
Sureto

1 Like

The cache values are created. You don’t have to actually use the variable name X in order to supply a value in the A position of the linear cache entry for layer 1. You can call it Fred or Barney if you want. It’s just a value. These test cases are purely sythethic, right?

But you are right that they are a bit sloppy there in the names. Here are the relevant lines that create the cache entry for layer 1:

    A1 = np.random.randn(4,2)
    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    Z1 = np.random.randn(3,2)
    linear_cache_activation_1 = ((A1, W1, b1), Z1)

So they called it A1 when it’s really “playing the role” of X or A0, but as I pointed out above the name is irrelevant. It’s just a synthetic value. The L_model_backward code just gets the caches as a python list consisting of tuples, it doesn’t care how you generated the value and literally doesn’t know that A1 was used instead of A0 as the name for the component used to construct the test case.

The variable names really are irrelevant from a technical perspective, but when you are writing “real” code that is going to live on and be used and maintained you want the code to be clear. So you’d never use the wrong names in code you cared about. Since students may look at the test code as you have, I will file a bug and suggest that they use more appropriate names there. But from a purely functional standpoint it doesn’t matter, as discussed above.

2 Likes

Yep, it looks like variable names are mismatched.

A1 is actually A0 or X
A2 is A1

Thanks @paulinpaloalto for helping with it.

1 Like