Hello, I ran into an issue during the derivation process of LSTM.
Is the formula given in the assignment wrong?
Any input will be appreciated.
Can you tell me where the place that you found suspicious is, please?
yes, I tried to derive the LSTM gradients (even though I hadn’t had much experience with Matrix calculus) and noticed that the formula for dgammau given in section “3.2 - LSTM Backward Pass”, “gates gradients” subsection
is actually comes out as equal to the following sum of derivative chains:
(dJ/da)(da/dc)(dc/dGAMMAu)(dGAMMAu/dgammau) + (dJ/dc)(dc/dGAMMAu)(dGAMMAu/dgammau)
which equals to 2dJ/dgammau
because
dJ/dgammau =
(dJ/da)(da/dc)(dc/dGAMMAu)(dGAMMAu/dgammau) = (dJ/dc)(dc/dGAMMAu)(dGAMMAu/dgammau) according to the chain rule.
i.e. if I’m not mistaken, the given formula is two times the actual derivative, which is still going to work I suppose, I just don’t get why that’s done.