How does the function backward_propagation relate to L_model_backward

It is great that you are doing that type of experimentation. It is a great way to both make sure you grasp the material and to extend your knowledge. You always learn something interesting!

You could just download and use the opt_utils_v1a library: it’s just another file that is in the same folder as the notebook itself. Just click “File → Open” from the notebook and have a look around. There is a topic about that on the FAQ Thread. You can just download that file or open it and “copy/paste” the functions you want to use to a file on your local computer.

But back to the comparison of the two APIs …

As I mentioned before, the point of L_model_backward is that it handles the completely general case of any number of layers, either sigmoid or relu as the activation at each layer and so forth. Notice that it calls 3 layers of other functions (subroutines) in order to accomplish its tasks: linear_activation_backward, which then calls linear_backward, relu_backward and sigmoid_backward, right? That’s a lot of logic.

Now look at the code for backward_propagation: it does not call any subroutines. All the logic that is expressed in all those layers of routines is “hard-coded” custom code all just written out in one function: it only handles a 3 layer network with relu as the hidden layer activation. Notice how they write out the derivative of relu in the required places. Also note that because forward_propagation is written in the same style, it can pack the A, W and b values for all layers into one list for the cache, instead of the nested (layered) approach to the caches taken in the C1 W4 general code. Also notice that they do not include a0 in the cache value in the backward_propagation case, but you need it for the calculation of dW^{[1]} the gradient of W^{[1]} at the first layer. The general formula is:

dW^{[l]} = \displaystyle \frac {1}{m} dZ^{[l]} \cdot A^{[l-1]T}

So for layer 1, that is A^{[0]}, but by notational convention that is just X, right? In the case of the general solution in L_model_backward, that A^{[0]} value in the above formula is the A_prev value that is returned in the linear_cache for layer 1 of forward propagation.

So the inclusion of X as a parameter is a choice that they made instead of putting the A0 value into their cache list. They could have done it either way, but given the general “hard-coded” approach they are taking here it makes more sense just to use X rather than doing the A0 renaming dance. That would just make the code a bit more confusing. Just my opinion of course. If you are going to use the code for your own purposes, you can do it the other way.

Well, actually, they probably did it that way for symmetry. They need both the X value and the AL value for the computation, right? In L_model_backward, you pass AL as a parameter and get X from the caches. Here they make the symmetric choice: they pass X as a parameter and get AL from the caches. The alternative would have been to include 4 values of A^{[l]} in the cache, but 3 values of all the W^{[l]} and b^{[l]} coefficients. Maybe they thought that the lack of symmetry there would have been aesthetically offensive or just confusing. These are the choices we have to make. :nerd_face:

It is worth considering using the whole suite of general functions from Course 1 Week 4. It’s more work now, but that might pay off later. If you think you later might want to use this code for other experiments, it might be worth the effort of converting to use the general code. Then if the next experiment you want to do involves a 4 or 5 layer network or a different activation function, you won’t have to rewrite everything.

1 Like