Grokking LSTM and GRUs some questions (Week 1 and 2)

I want to get a deeper understand of GRUs and LSTMs, since from the course I mostly get the feeling of “here’s something complicated”.

My first question on this is what is actually proved about GRUs and LSTMs. My suspicion is that they just work empiricially, I wonder what empirical results on them. I found one paper discussion them a bit, but this focused on the 1-d versions which sort of defeats the purpose for me, because it feels like the whole point of gated RNNs is to forget some things and remember some things.

Some thoughts so far. The idea of gated RNNs seem to be kind of like a “resnet” where you have a connection through the RNN cells without activation functions but rather with gating. Does this seem like a good viewpoint.

Another approach you can use to grok concepts is to try and understanding if the object you have is “as simple as possible”. Another think you can do is relate concepts, so you might ask what the difference between an LSTM and a GRU is.

In terms of “as simple as possible”. If you look at the different between LSTMs and GRU’s there is a question about why the “forgetting” gate isn’t 1-input gate like a GRU. This would allow fewing gates.

You could try to understand a GRU as in some sense merging the cell state and the hidden input. You also seem to combine the two tanh gates together. With an LSTM you first generate an “update” for the cell state with a tanh, then after updating you generate a new hidden state and an output. Whereas with a GRU, you seem to take the output conceptually from the “input state of the LSTM”. This makes it feel like the GRU in some sense “behind the LSTM” since it uses the “old” cell state rather the updated cell state to generate output.

I do wonder what h is actually for in an LSTM it gets generated from the previous cell state with a tanh layer, used in calculations and the immediately discarded.

I wonder if you could progressively simplify an LSTM towards a GRU and see what is usefully. One step might be to not calculate h, but input c instead. Another step is to generate the output form the input tanh, rather than have an tanh.

Hope this isn’t too hard to follow. Also I’m not really sure if this theoretical understanding is actually meaningful in NLP. I get the impression you would just use a different model and see how well it works, and if I understand it GRU’s are preferred. Do other people agree with these ideas and can ghey think of anything else along similar lines.

@talwrii I am not sure I am fully qualified to answer your question but I will try.

For one, based on everything I’ve read, the concept of ‘grokking’ makes me feel a bit nervous-- Or it is something that might happen, sometimes, in exceptional circumstances, and then we don’t know why.

If I had to take a guess, I would say it comes down to the ‘shape’ of the problem that happens to fit.

But, point being, it is not something you can depend on.

As to the GRU/LSTM case, I am not 100% sure on their origin (it could come from physics), but gates, typically, you’re talking about circuits-- The basis for all modern computing (or you can do everything with just a NAND, right ?).

So, that is the way I like to think of it. We’re building little ‘circuits’ and then chaining them along.

Seeing it this way I also think makes the concept of what is happening ‘less abstract’.

*Your question would also be easier to answer if you came up with a drawing of what you mean/are trying to say :grin:

hi @talwrii

The idea of significance of lstm with similarity does come same as you mentioned for GRU.

The only difference here would be lstm uses activation function for this forget and update gate when it comes to an input.

The significance of GRU and LSTM holds significance when we have a very long sequence of data and the idea behind prediction of next word or translating a corpus(chunk) in a given long sequence of word can be addressed using GRU and LSTM.

I am sharing a link about your query on how lstm holds similarity with GRU,

Feel free to ask or give any feedback.

Regards
DP