Formal explanation of change of order in chain rule

When applying the chain rule in backward propagation of gradients, we know that \frac{\partial J}{\partial \mathbf{A}^{[ 1 ]}} = \frac{\partial J}{\partial \mathbf{Z}^{[ 2 ]}} \; \frac{\partial \mathbf{Z}^{[ 2 ]}}{\partial \mathbf{A}^{[ 1 ]}}. We also know that \frac{\partial \mathbf{Z}^{[ 2 ]}}{\partial \mathbf{A}^{[ 1 ]}} = \frac{\partial }{\partial \mathbf{A}^{[ 1 ]}} \left( \mathbf{W}^{[ 2 ]} \mathbf{A}^{[ 1 ]} + \mathbf{b}^{[ 2 ]} \right) = \mathbf{W}^{[ 2 ]}. However, instead of \frac{\partial J}{\partial \mathbf{A}^{[ 1 ]}} = \frac{\partial J}{\partial \mathbf{Z}^{[ 2 ]}} \mathbf{W}^{[ 2 ]}, which is not a valid operation, we have \frac{\partial J}{\partial \mathbf{A}^{[ 1 ]}} = {\mathbf{W}^{[ 2 ]}}^\top \frac{\partial J}{\partial \mathbf{Z}^{[ 2 ]}}, which is a valid operation, but the order of chain rule application in matrices (which is not commutative) has reversed and \mathbf{W}^{[ 2 ]} transposed. I know this makes it computable and matches the dimensions of the gradient to that of the \mathbf{A}^{[ 1 ]}, which should be the case.

Are there any formal justifications, apart from this dimensions argument, why this happens? I really appreciate if anyone can help.

There are several layers to the answer here. The top layer answer is that Prof Ng has specifically designed this course not to require knowledge even of univariate calculus, let along multivariate or matrix calculus. So the good news is you don’t need to know calculus. The bad news is that means you have to just take his word for the formulas.

The next layer down is that if you have the math background, Prof Ng has slightly simplified things here. In the fully mathematical treatment of all this the “gradients” would be the transpose of what Prof Ng shows. With Prof Ng’s simplification, we have the rule that the shape of the gradient is the same as the shape of the underlying object. In math, it would be the transpose of that. Of course we also have this mathematical identity:

(A \cdot B)^T = B^T \cdot A^T

Given that he’s not really showing the full derivations, the simplification just makes things work more smoothly in terms of how we apply the gradients. Here’s a thread with links to the derivations and general info about matrix calculus.

1 Like

Thank you Paul. My question is not about practicing the contents of this course, which as you said, does not require an understanding of matrix calculus to a deep level. I am very specifically interested in that part and I read a number of matrix calculus resources these past days trying to find the answer to this question, but what I read everywhere is chain rule in matrix calculus is non-commutative, which like other parts of linear algebra should be the case. I am just trying to see if anybody here knows the reason for this one question, which is the only part that I could not derive myself in calculating the gradient.

Hello @ajallooe,

That is because the “scalar chain rule” does not apply here.

This post has some references to wikipedia, and also an example.

In the example, step 1 to 3 is for setting up the example; step 4 shows an expected result we want to derive; step 5 assumed the chain rule applied but turned out unable to calculate the answer; step 6 does it with the right way which is not following our chain rule.

If you have accepted that the scalar chain rule does not apply but were looking for a different version of chain rule that applies, I am not familiar with that topic. However, as I have also quoted in that post

Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of each component of the dependent variable with respect to each component of the independent variable.

I think this is the key condition that you can base on to verify any rule that you make for the situation of your interest. The wikipedia page has listed many identities but if it does not cover what you are interested in, you might resort to this condition and try to establish some identities yourself. My example in that post was just trying to verify it that way.

Cheers,
Raymond

Thank you, Raymond. This is what I was looking for and I was reaching to what you said thinking on my own time as well.
I am writing this as this may be of use to someone else (I am the kind of person who is uneasy to trust the math if I don’t know how it works and somebody else may be like me here):
The thing is I am coming from a non-DL ML background and in there, first off, whatever we take a derivative of using matrix calculus is a scalar, and secondly, The things we are deriving with-respect-to are vectors, not matrices. That is why the regular “extended” derivation rules work.
In DL, however, it does not work that way all the time. In truth, the derivative of a non-scalar-valued function w.r.t. matrices creates tensors of order > 2 (i.e., 3+ dimensional arrays), not matrices.
What makes it difficult is that there is a nice property in “regular” DL: we have a dimension—that of different examples, whose derivative with respect to other elements of that same dimension in other matrices are zero, i.e., if \mathbf{a} is dependent on \mathbf{B} and \mathbf{a} is of size 1 \times m and \mathbf{B} is of size n \times m, in which both second dimensions (of size m) are for different examples in the data, then \frac{\partial \mathbf{a}}{\partial \mathbf{B}} is actually a tensor of order 3 (actually 4: size 1 \times m \times n \times m, but you can discard the singleton dimension, and if we use the numerator convention for matrix calculus) but the tensor is “diagonal” in the dimensions of size m, i.e., if \mathbf{D} = \frac{\partial \mathbf{a}}{\partial \mathbf{B}}, D_{i, j, k} = 0 if i \neq k. This means \mathbf{D} is effectively rank-2 (this means that we are working with a sequence of vectors, rather than full-blown matrices in all their expressive power really). What is done in “regular” DL is that these results are presented in not the correct format of rank 3+ tensors but in “effective” matrices (the fact that the size of gradients have to match the actual matrices/vectors in which gradients are taken with-respect-to is also not correct, if we are not using this “effective” matrices). This works everywhere except in the example I talked about. Why? Because the same is true in chain rule: if \mathbf{f} is a vector-valued (of size 1 \times m) function (as \mathcal{L} is), and for some matrix \mathbf{A} of size n \times m, \frac{\partial \mathbf{f}}{\partial \mathbf{A}} is a (Jacobian) tensor of size 1 \times m \times n \times m and you cannot take the simplified “efficient” matrix and use the chain rule for that. It has the first non-singular dimension of size m removed. If you do the chain rule by the full tensors, it works (if you also use the appropriate extension of the dot product for that). You can then “effective”-ize the resulting tensors (which have a diagonal sub-tensor) and get at the same result as you would get by {\mathbf{W}^{[ i ]}}^\top \frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{[ i ]}}.

Mohammad