Are we caching Z for backprop only for RELU?

In back propagation we calculate dW and db, and in the examples we see in this course we use the previous layer’s A values to do it, so… it is wise for us to cache them when we first compute them (so that we don’t need to calculate them again).

However, Andrew Ng often talks about caching Z instead of A.

Is this because Z is used for the calculation when RELU (leaky or not) is the activation function, and he has that in mind? Or is there another reason?

I can’t see why I need Z cached when I use sigmoid and tanh.

Hey @Tal_Alon,
You will find that both Z and A are cached for the back-propagation purposes. Let’s consider DLS C1 W4 A1 for understanding this better.

In this assignment, first check out the linear_forward function. Here, you can find that A is stored in cache alongside W and b. Now, check out the linear_activation_forward function. In this function, you can see that relu and sigmoid functions are used. You can check out their implementations in the dnn_utils.py file. If you do so, you will find that Z is stored in cache. Now, coming back to the linear_activation_forward function, you can see that the former one (containing A, W and b) is the linear_cache and the later one (containing Z) is the activation_cache, both of which are stored as a tuple in cache. So, here, we can see that during the forward propagation both A and Z are stored in cache. Now, let’s see where they are used in the backward propagation.

Here, check out the linear_backward function, and here, you will find the use of the variables stored in linear_cache. And if you checkout the linear_activation_backward function, you will find that 2 functions are used, relu_backward and sigmoid_backward. Once again, you can check out their implementations in the dnn_utils.py file, and you will find the use of Z which was stored in activation_cache during the forward propagation.

Let me know if this helps.

Cheers,
Elemento

Thanks for the comprehensive answer Elemento. I didn’t get to the assignment yet, and when implementing other NN I made due with only A (used tanh for inner layers).

I can see how storing both Z and A in cache can help with generalization of the code, especially when other activation functions are used.

When I get to the assignment I’ll post here again.

Thanks again.

I finished the assignment not long ago, it was not hard, and made the whole process easy to understand.

And you described it very well Elemento - and as I suspected the cached Z itself is only used in the relu_backwards function (it is also used in sigmoid_backward but only to calculate A).

Summing it up - So, as I see it , the answer to my question is Yes, we have to cache Z for the back propagation calculation of RELU (and leaky RELU), but we only need to cache A for the calculations of sigmoid and tanh. Since A can be calculated from Z, we generally just cache Z. This is the best generalized form of the code.

The code would run slightly faster though for tanh/sigmoid-only networks if we cache A directly.

Yes, that’s true, but I think your next point is the relevant one:

We are trying to write general code here that works in all cases. The general formula that we need to implement here and the reason we cache Z is this:

g^{[l]'}(Z^{[l]})

If we had A = g(Z) cached instead, that does not turn out to be relevant in all cases, as you said. We just happen to get lucky that the derivatives of tanh and sigmoid are functions of A, but that is not true for all functions. Although now that you mention it, you can also sort of “fake it” with ReLU as well by using A. But the point about generality still stands …

How about this as an alternative: caching things isn’t really that expensive, since it just costs a bit of memory. We could cache both Z and A in the “activation cache” and then we would have a choice down in the “activation backward” routine. Note that the A value that we have in the linear cache is the wrong one: it’s for the previous layer. Prof Ng’s team has chosen one way to do it, but you can do it your own way when it’s your turn to write the code “for real”. :nerd_face:

Although it is probably worth saying at this point that in the “real world”, no-one really builds all this mechanism themselves in python. For real applications of all this, people use one of the “frameworks” which has built everything for you. We will learn about TensorFlow in Course 2 of this series, but there are other alternative platforms as well, e.g. PyTorch and many others.

3 Likes

Got it. Thank you Paul.