Hello @yujin_lee2,
As @paulinpaloalto suggested, we often resort to checking the matrices’ indices to make sure all symbols are in order. As he also pointed out, we can’t always do matrix calculus in the way we do scalar calculus. As wikipedia very well summaried:
Note that exact equivalents of the scalar product rule and chain rule do not exist when applied to matrix-valued functions of matrices.
We can do a simple exercise to verify that indeed we can’t use chain rule like that. Let’s say we define matrices Z, W, A and a scalar L this way,
Obviously,
Can our scalar “chain rule” recovers the same result?
The answer is no. But the following will do:
Lastly, in my step (5), I said “this is wrong”, because a matrix-by-matrix derivative like that won’t result in a 2x2 matrix. Because as the same wikipedia page says:
Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of each component of the dependent variable with respect to each component of the independent variable.
So to correct that wrong thing, it is going to be a tensor that organizes 16 derivative results.
I think the wikipedia page has a lot of useful examples if you want to dig deeper.
Cheers,
Raymond