I can understand the structure of LSTM and its goal but not the GRU, since the ‘a’ cells and ‘c’ cells perform different functions in the network.
But in GRU, ‘a’ cell and ‘c’ cell combined together to one cell.
Current input is x<10> and output is y<10>, if c<10> still contains the information of c<0>, so the output y<10> is only depend on c<0> and x<10>, the information from x<1> to x<9> are all lost.
I’m wondering whether this will result in poor fitting ability because the information from x<1> to x<9> cannot propagate to the further layers.
I think LSTM can overcome this problem because ‘c’ cells are for caching long term information and ‘a’ cells are for caching short term information. But the GRU’s output have to choose whether using long term or short them information.
Besides, Im also wondering how LSTM and GRU overcome the vanishing gradient problem. Professor Ng only said these two structures can help the long term information to propagate to the further layers but why it can solve the vanishing gradient?