I was reading the optional articles in week 4! I was curious to know, and I’ve encountered this constantly after implementing many neural networks in numpy, why is the gradient of z in the vectorized version calculated as:

and not like this in the inverse order?

It’s always presented as the former but never really understood why.

Does it have to do with the fact that we take a matrix multiplication? And if so, why?

I don’t get why the positions of \frac{∂J}{∂a} and \frac{∂a}{∂z} are inverse for the final derivation. Given that the order of matrix multiplication matters, this gave me a little bit of confusion.

That is not part of the course and it is not required to understand that. I’ve skimmed some of that material, but it was several years ago, so I won’t be able to answer that without spending an hour to refresh my memory on that material. Sorry, but I’m not feeling like that would be a good use of my time right at the moment. Jonas goes through everything in quite a bit of detail. If you really want to understand it, I suggest you go through it all again and make sure you understand in detail the notation he is using and how he is representing everything. Note that matrix multiplication is not commutative in general, but the dot product of two 1D vectors is commutative. I’m not sure if that is relevant in this context or not, since I have not studied his notation.

A couple of high level points:

Prof Ng just gives you the formulas for back propagation, so you can just use them as is.

By the end of DLS C2, we will switch to using TensorFlow and won’t have to worry about implementing back propagation anymore. The platform takes care of that for us.

We are not required to understand calculus in order to succeed here. If you want to go through the full derivations, the other approach besides the material on Jonas’s website is to check out the links on this thread.

Given I know your background/level of experience it is interesting to hear you express this-- That said (and I have no problems with your feelings) I know I still need to improve my own math skills a bit… Unfortunately it has never been my strong suit .

Yet, especially having completed the convolutions course I can recognize given the depths of the networks involved, especially now that we start to work in skip-connections, even for someone very well versed working out the derivatives for the back-prop is simply a ‘non-trivial’ task.

However, based on what I’ve read so far outside of class on AutoGrad/Gradient Tape, here is a question you might be able to provide some insight on as it is still unclear to me:

When Pytorch/TF does the gradient for back-prop, is it working out some sort of heuristic (?) [a number of things I read seem to give me that perception]-- or is this flat out pure CAS [which is more what I’d kind of expect… and if it isn’t just CAS, why isn’t it ?]

Sorry, I don’t know what CAS stands for there. You can find more details by reading the relevant docpages for Torch or TF. Those are just the top level intros and you can find more links from there.

What I understand to be the case, is that for functions that they implement, e.g. all the activation functions, they just include the code to directly implement the derivative functions. In other words, they do the actual calculus in closed form and then write the code to implement that function. E.g. you’ve seen what the derivative of sigmoid is by now in DLS C1. For everything else, they do the mathematical algorithm for automatic differentiation, which in general principle looks like the Gradient Checking logic that we implemented in DLS C2 W1. It does “numeric differentiation”. But they’ve thought a lot harder about it and gone out of their way to get good performance without sacrificing too much accuracy.

The high level point is what I said above: you just don’t need to worry about. They handle it for you. If you want to be an implementer of frameworks like that, then you need to do the same work they did in order to build that framework. But if you are simply using one of the frameworks, you don’t really have to worry about it.

Oh hey that’s cool. CAS = ‘Computer Algebra System’, like what is in Wolfram Alpha, or high level TI calcs for some time, etc. [And I’m sure Matlab, but it has been some time since I had the cash to pay/play, and probably Octave, but I have not ventured that far yet, etc].

It is not the easiest thing in the world to design one of these things, but it is tried and true even totally aside from NN’s. And, yes, they can handle differentiation, symbolic integration, etc (it is all just rules, no ?)

But again some of the things I’d read made it sound more like in TF/PyTorch they are working out the back-prop as an optimization problem (thus the ‘heuristics’), rather than just a ‘math’ one.

In any case, if I come upon a future link I can be more clear about this, we will discuss in another forum.

What are the gradients and how to you calculate them?

How do you apply the gradients once you’ve calculated them?

There are no “heuristics” in 1): that’s just the autodiff or numerical difference calculation or the hand implemented derivative functions for the “canned” functions they provide (sigmoid, ReLU et alia).

The optimization algorithm can then do other things with the gradients, e.g. computing exponentially weighted averages as in Adam and Adagrad and so forth.

But the pure calculus part is cut and dried and I’d be surprised if they used the CAS method you describe. But you can read the TF docs and see if you can find more about that.

But the top level point is worth reiterating: the torch and TF teams have already figured this out. You don’t need to worry about it, unless you are just curious and want to understand how they did it.

If step (1) is justified, then I think step (2) is already a proof for step (3) because step (2) is the only way to recover step (1). What do you think?