How does the function backward_propagation relate to L_model_backward

Hello,

The assignment „Optimization_methods“ loads a function „backward_propagation“.

I am confused about its input parameters and how it compares to the function „L_model_backward“ from previous assignments.

In Exercise1 there is this demo code shown:

# Backward propagation.

grads = backward_propagation(a, caches, parameters)

Question: are the function’s input parameters correct?

Later on, in def model, the function is called like this:

grads = backward_propagation(minibatch_X, minibatch_Y, caches)

From previous assignments backward propagation was implemented in this function:

grads = L_model_backward(AL, Y, caches)

Question: I assumed L_model_backward and backward_propagation would do the same thing. But given that the first input parameter is different, I doubt it. Can you please explain in which regard backward_propagation is different from L_model_backward?

Well, you can implement the same logical function in lots of different ways. The APIs here are not the same. In the Optimization Assignment, they don’t really need the full generality that they did back in Course 1 Week 4. Here they are just writing the code to be simple and exactly what they need for the specific case at hand. E.g. notice that forward_propagation and backward_propagation in this assignment are hard-wired to a 3 layer network. L_model_backward is designed to handle any number of layers.

Hi Paul,

I understand that one can implement the functions in different ways. What I don’t get yet is why the input parameters are chosen so differently.

backward_propagation uses „mini batch_X“ as first input parameter, which is on the „left side“, the input side of the model.

L_model_backward uses „AL“ as first input parameter, which is on the „right side“, the output side of the model. From what we learned in the courses, this makes more sense to me.

I wonder, what was the rationale of implementing backward_propagation with these input parameters?

Why am I interested in this? Well, to improve my learning, I tried to use the model with a different dataset and in order to not waste Coursera’s resources I copied the code to my local machine. As I don’t have the opt_utils_v1a library, I tried replacing the functions with those we did in the previous assignments.

It is great that you are doing that type of experimentation. It is a great way to both make sure you grasp the material and to extend your knowledge. You always learn something interesting!

You could just download and use the opt_utils_v1a library: it’s just another file that is in the same folder as the notebook itself. Just click “File → Open” from the notebook and have a look around. There is a topic about that on the FAQ Thread. You can just download that file or open it and “copy/paste” the functions you want to use to a file on your local computer.

But back to the comparison of the two APIs …

As I mentioned before, the point of L_model_backward is that it handles the completely general case of any number of layers, either sigmoid or relu as the activation at each layer and so forth. Notice that it calls 3 layers of other functions (subroutines) in order to accomplish its tasks: linear_activation_backward, which then calls linear_backward, relu_backward and sigmoid_backward, right? That’s a lot of logic.

Now look at the code for backward_propagation: it does not call any subroutines. All the logic that is expressed in all those layers of routines is “hard-coded” custom code all just written out in one function: it only handles a 3 layer network with relu as the hidden layer activation. Notice how they write out the derivative of relu in the required places. Also note that because forward_propagation is written in the same style, it can pack the A, W and b values for all layers into one list for the cache, instead of the nested (layered) approach to the caches taken in the C1 W4 general code. Also notice that they do not include a0 in the cache value in the backward_propagation case, but you need it for the calculation of dW^{[1]} the gradient of W^{[1]} at the first layer. The general formula is:

dW^{[l]} = \displaystyle \frac {1}{m} dZ^{[l]} \cdot A^{[l-1]T}

So for layer 1, that is A^{[0]}, but by notational convention that is just X, right? In the case of the general solution in L_model_backward, that A^{[0]} value in the above formula is the A_prev value that is returned in the linear_cache for layer 1 of forward propagation.

So the inclusion of X as a parameter is a choice that they made instead of putting the A0 value into their cache list. They could have done it either way, but given the general “hard-coded” approach they are taking here it makes more sense just to use X rather than doing the A0 renaming dance. That would just make the code a bit more confusing. Just my opinion of course. If you are going to use the code for your own purposes, you can do it the other way.

Well, actually, they probably did it that way for symmetry. They need both the X value and the AL value for the computation, right? In L_model_backward, you pass AL as a parameter and get X from the caches. Here they make the symmetric choice: they pass X as a parameter and get AL from the caches. The alternative would have been to include 4 values of A^{[l]} in the cache, but 3 values of all the W^{[l]} and b^{[l]} coefficients. Maybe they thought that the lack of symmetry there would have been aesthetically offensive or just confusing. These are the choices we have to make. :nerd_face:

It is worth considering using the whole suite of general functions from Course 1 Week 4. It’s more work now, but that might pay off later. If you think you later might want to use this code for other experiments, it might be worth the effort of converting to use the general code. Then if the next experiment you want to do involves a 4 or 5 layer network or a different activation function, you won’t have to rewrite everything.

1 Like

Wow. That’s really insightful.

Paul, I wanna thank you for pointing me to the source code for the imported functions,
showing me a clever way of implementing “relu backward”
and most of all for your detailed explanation of how L_model_backward and backward_propagation compare. It’s really helpful for me!

I think I will now try to add all the code needed to get the Adam optimizer on top of the general solution. Keep your fingers crossed :blush:

That’s great! Adding the Adam logic to the “general” code sounds like the best way to go. Then you’ve got a powerful toolkit that you can use in lots of cases. Let us know how it works! :nerd_face:

Thank you for your detailed answer