Not understanding the structure of GRU

I can understand the structure of LSTM and its goal but not the GRU, since the ‘a’ cells and ‘c’ cells perform different functions in the network.
But in GRU, ‘a’ cell and ‘c’ cell combined together to one cell.
Current input is x<10> and output is y<10>, if c<10> still contains the information of c<0>, so the output y<10> is only depend on c<0> and x<10>, the information from x<1> to x<9> are all lost.
I’m wondering whether this will result in poor fitting ability because the information from x<1> to x<9> cannot propagate to the further layers.

I think LSTM can overcome this problem because ‘c’ cells are for caching long term information and ‘a’ cells are for caching short term information. But the GRU’s output have to choose whether using long term or short them information.

Besides, Im also wondering how LSTM and GRU overcome the vanishing gradient problem. Professor Ng only said these two structures can help the long term information to propagate to the further layers but why it can solve the vanishing gradient?


Were you able to find answers to your questions?

Would love to know the answer to this as well. I have the exact same questions!

Hello Yifu and Ajinkya,

Here’s a quick link from this post: Understanding GRU - #3 by piyush23, which can give you a broad idea on how GRU solves the vanishing gradient issue while using an RNN architecture.

DLS mentor Kic has posted a link in one of his replies:

The post aims at solving the vanishing gradient problem which comes with a standard recurrent neural network.

To solve the vanishing gradient problem of a standard RNN, GRU uses, so-called, update gate and reset gate. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction.