Backpropagation through time derivation question

Hi, I have a question about deriving backpropagation through time, I do not get how get the order of the matrix after the derivation. So for example, what rule makes dWax have dtanh in the front while dxt has dtanh in the back? To be honest, I can know the order for most of them if I write out the dimensions since I know what dWax is going to look like for example. However, I cannot do that for dWaa because it is a square matrix. Thank you so much for whoever answers, it’s been a long 3 days reading about matrix calculus haha.

Speaking of which, when there is like a formula or something that you guys don’t understand how it’s derived and it bothers you, would you guys move on after a while?

Hello @Eugene_Ku

Matrix differentiation is different from scalar, and the chain rule that we know works perfectly only with scalar differentiation. This post gave an example of when it breaks.

The most brute-force way is to break down a matrix equation into a list of scalar equations, do the differentiation, and form them back into a matrix equation. This can easily be done with some simple low-rank matrices (like the linked post).

Does it have to be? Is there anyway we can make it not be a square matrix? If not, would you mind to show us how you used the dimensions to verify them?

I vote Yes, and come back later on if it is/become very important.

Cheers,
Raymond

1 Like

Hi Raymond, good to see you again. On your comment,

Blockquote
Does it have to be? Is there anyway we can make it not be a square matrix? If not, would you mind to show us how you used the dimensions to verify them?

edit: oops, I thought I got it but I don’t. So back to the dWaa, dimension checking method would not work because either order results in the same dimension right?
This is how I dimension check btw (used z instead of tanh):


This method would work for all except dWaa :frowning:

Thank you so much always Raymond! You are like the savior when I thought I was all alone in a sandstorm hah.

Hello @Eugene_Ku,

Yes. I just checked too. Then I think the luck is that we have dWax as our reference - dWaa should look like dWax :stuck_out_tongue: . I know this is a dirty way but … it is a way. As for why dWax should look like that, my reason is as I said - matrix differentiation isn’t like scalar so the ordering is not arbitary.

But yes, you are right, dimension checking won’t be helpful for dWaa.

Not everyone is willing to be in a sandstorm, but if the ticket to dream is there, what choice do you and I have? :wink: Let’s go!

Cheers,
Raymond

1 Like

ahaha thats totally right!

@rmwkwok Hi Raymond, I decided to make a post about this topic! Matrix Calculus For Deep Learning: Taking Derivatives of Matrices Through Time | by Eugene Ku | Aug, 2023 | Medium
Let me know what you think!

Hey @Eugene_Ku,

Wonderful! That is in my to-do for today. :raised_hands:

Raymond

1 Like

Hello @Eugene_Ku,

Interesting notes! :wink:

“0 if i != a” is very intuitive, because it is just saying that the a-th sample in X doesn’t have impact on the i-th sample in Y, which is, of course, true.

F_{abcd} is a very useful notation! It tells you immediately it is 4 dimensional, and we can play with the subscripts by arguments.

Cheers,
Raymond

PS: The notes can be improved if it is for audience in beginner level, otherwise, it is an interesting notes with a lot of helpful discussions. Thanks for sharing, @Eugene_Ku!

1 Like

Thank you! I’ll keep that in mind. Might make one for multivariate chain rules as well but thanks for reading and the feedback :blush: