It is great that you are doing that type of experimentation. It is a great way to both make sure you grasp the material and to extend your knowledge. You always learn something interesting!

You could just download and use the *opt_utils_v1a* library: it’s just another file that is in the same folder as the notebook itself. Just click “File → Open” from the notebook and have a look around. There is a topic about that on the FAQ Thread. You can just download that file or open it and “copy/paste” the functions you want to use to a file on your local computer.

But back to the comparison of the two APIs …

As I mentioned before, the point of *L_model_backward* is that it handles the completely general case of any number of layers, either *sigmoid* or *relu* as the activation at each layer and so forth. Notice that it calls 3 layers of other functions (subroutines) in order to accomplish its tasks: *linear_activation_backward*, which then calls *linear_backward*, *relu_backward* and *sigmoid_backward*, right? That’s a lot of logic.

Now look at the code for *backward_propagation*: it does not call *any* subroutines. All the logic that is expressed in all those layers of routines is “hard-coded” custom code all just written out in one function: it only handles a 3 layer network with *relu* as the hidden layer activation. Notice how they write out the derivative of *relu* in the required places. Also note that because *forward_propagation* is written in the same style, it can pack the A, W and b values for all layers into one list for the cache, instead of the nested (layered) approach to the caches taken in the C1 W4 general code. Also notice that they do not include a0 in the cache value in the *backward_propagation* case, but you need it for the calculation of dW^{[1]} the gradient of W^{[1]} at the first layer. The general formula is:

dW^{[l]} = \displaystyle \frac {1}{m} dZ^{[l]} \cdot A^{[l-1]T}

So for layer 1, that is A^{[0]}, but by notational convention that is just X, right? In the case of the general solution in *L_model_backward*, that A^{[0]} value in the above formula is the *A_prev* value that is returned in the *linear_cache* for layer 1 of forward propagation.

So the inclusion of X as a parameter is a choice that they made instead of putting the A0 value into their cache list. They could have done it either way, but given the general “hard-coded” approach they are taking here it makes more sense just to use X rather than doing the A0 renaming dance. That would just make the code a bit more confusing. Just my opinion of course. If you are going to use the code for your own purposes, you can do it the other way.

Well, actually, they probably did it that way for symmetry. They need both the X value and the AL value for the computation, right? In *L_model_backward*, you pass AL as a parameter and get X from the caches. Here they make the symmetric choice: they pass X as a parameter and get AL from the caches. The alternative would have been to include 4 values of A^{[l]} in the cache, but 3 values of all the W^{[l]} and b^{[l]} coefficients. Maybe they thought that the lack of symmetry there would have been aesthetically offensive or just confusing. These are the choices we have to make.

It is worth considering using the whole suite of general functions from Course 1 Week 4. It’s more work now, but that might pay off later. If you think you later might want to use this code for other experiments, it might be worth the effort of converting to use the general code. Then if the next experiment you want to do involves a 4 or 5 layer network or a different activation function, you won’t have to rewrite everything.