Can you tell me where the place that you found suspicious is, please?
yes, I tried to derive the LSTM gradients (even though I hadn’t had much experience with Matrix calculus) and noticed that the formula for dgammau given in section “3.2 - LSTM Backward Pass”, “gates gradients” subsection
is actually comes out as equal to the following sum of derivative chains:
(dJ/da)(da/dc)(dc/dGAMMAu)(dGAMMAu/dgammau) + (dJ/dc)(dc/dGAMMAu)(dGAMMAu/dgammau)
which equals to 2dJ/dgammau
because
dJ/dgammau =
(dJ/da)(da/dc)(dc/dGAMMAu)(dGAMMAu/dgammau) = (dJ/dc)(dc/dGAMMAu)(dGAMMAu/dgammau) according to the chain rule.
i.e. if I’m not mistaken, the given formula is two times the actual derivative, which is still going to work I suppose, I just don’t get why that’s done.

