I want to get a deeper understand of GRUs and LSTMs, since from the course I mostly get the feeling of “here’s something complicated”.
My first question on this is what is actually proved about GRUs and LSTMs. My suspicion is that they just work empiricially, I wonder what empirical results on them. I found one paper discussion them a bit, but this focused on the 1-d versions which sort of defeats the purpose for me, because it feels like the whole point of gated RNNs is to forget some things and remember some things.
Some thoughts so far. The idea of gated RNNs seem to be kind of like a “resnet” where you have a connection through the RNN cells without activation functions but rather with gating. Does this seem like a good viewpoint.
Another approach you can use to grok concepts is to try and understanding if the object you have is “as simple as possible”. Another think you can do is relate concepts, so you might ask what the difference between an LSTM and a GRU is.
In terms of “as simple as possible”. If you look at the different between LSTMs and GRU’s there is a question about why the “forgetting” gate isn’t 1-input gate like a GRU. This would allow fewing gates.
You could try to understand a GRU as in some sense merging the cell state and the hidden input. You also seem to combine the two tanh gates together. With an LSTM you first generate an “update” for the cell state with a tanh, then after updating you generate a new hidden state and an output. Whereas with a GRU, you seem to take the output conceptually from the “input state of the LSTM”. This makes it feel like the GRU in some sense “behind the LSTM” since it uses the “old” cell state rather the updated cell state to generate output.
I do wonder what h is actually for in an LSTM it gets generated from the previous cell state with a tanh layer, used in calculations and the immediately discarded.
I wonder if you could progressively simplify an LSTM towards a GRU and see what is usefully. One step might be to not calculate h, but input c instead. Another step is to generate the output form the input tanh, rather than have an tanh.
Hope this isn’t too hard to follow. Also I’m not really sure if this theoretical understanding is actually meaningful in NLP. I get the impression you would just use a different model and see how well it works, and if I understand it GRU’s are preferred. Do other people agree with these ideas and can ghey think of anything else along similar lines.