Explanation for derived gradients for LSTM back-prop?

Hello here is my question from the original discussion board:

This looks like just a notation issue. That is to say: what do you think Prof Ng means by dtanh? And what does he mean by db_a? Also note that we’re doing partial derivatives here, not univariate ones.

1 Like

Thanks for the reply!
So are you saying that what he means by what he wrote is what I wrote more verbosely?

I’m a bit confused because I thought he said that notation is shorthand for derivative w.r.t. the loss not the output (i.e. dtanh := dtanh/dL). Am I mistaken?

Yes, you have it “upside down”. When he says dfoo, what he means is \displaystyle \frac {\partial L}{\partial foo} or perhaps \displaystyle \frac {\partial J}{\partial foo} depending on the context. And sometimes when it’s just a Chain Rule factor at a given layer, the numerator contains something other than L or J. That’s part of the problem: there is some built-in ambiguity in his notational conventions. He then refers to it as “the gradient of foo”, but that’s really not quite right either: it’s the gradient of J (or whatever) w.r.t. foo.

For convenience, he also makes a few other shortcuts here. E.g. in the notation we use here the gradient of an object has the same shape as the object, which makes the parameter update process simpler. If you really go “full math”, it turns out that the gradient ends up being the shape of the transpose of the base object. So I salute your desire to understand in more detail what is going on here, but these courses are specifically designed not to require even univariate calculus as a prerequisite. So there’s no way he can show the “full math” in this context. The math here is stuff that most people haven’t seen unless you were a math or physics or EE major. Here’s a thread which has pointers to derivations and info about matrix calculus if you want to go deeper.