I read the papers of Cho et al. 2014 and Chung et al 2014. In Dr. Ng’s tutorial, he put “C_t = (Gamma_u)(C_tilde_t) + (1- Gamma_u)(C_(t-1)).” This equation is based on the paper of Chung. But, In Cho’s paper (the original one of GRU), it should be C_t = (1-Gamma_u)(C_tilde_t) + (Gamma_u)(C_(t-1)). Does anyone that has read the papers agree, or check the papers to see if my understanding is correct?

Cho et al. 2014 “Learning Phrase Presentations Using RNN Encoder-Decoder for Statistical Machine Translation.”

That’s interesting point. I also went through the paper, and see some inconsistencies in there.

At first, to clarify all flows inside GRU cell, I wrote a whole picture including flows.

This is pretty much consistent to Andrew’s talk.

Then, the problem is this portion, i.e, output from the update gate. As Andrew is using different notations, we need to translate in our head…

I also put an illustration about the proposed hidden activation function from the paper.

- The update gate z (=\Gamma_u in Andrew’s talk) selects whether the hidden state is to be updated with a new hidden state \tilde{h} (=\tilde{c}^{<t>} in Andew’s talk).
- The reset gate r (=\Gamma_r in Andrew’s talk) decides whether the previous hidden state is ignored.

In this paper, the role of r is clearly stated.

In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only

This is consistent to Andrew’s talk. See “tanh”. If \Gamma_r = 0, then input to “tanh”, i.e., \Gamma_r *c^{<t-1>} becomes 0. Then, only x^{<t>} will be used for the calculation of the hidden state at time t.

The problem is z_j (\Gamma_u). In this paper, it is used as follows.

h_j^{<t>} = z_jh_j^{<t-1>} + (1-z_j)\tilde{h_j}^{<t>}

And, the paper said that,

On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember long- term information.

So, interestingly, in this paper, z_j = 1 means “keep the memory, i.e, the previous hidden state”.

On the other hand, in the case of Andew’s talk, \Gamma_u = 1 means, “update the state using current state, and ignore the previous hidden state”.

I think that’s the reason of confusions.

With looking at some recent articles, Andrew’s definition seems to be much popular, since it is more “intuitive”, i.e, update=1, then, update.

Hope this clarifies.

Thank you so much for your reply. In Cho’s paper, he listed the equations as follows:

h_(t-1) = C_(t-1) (Andrew’s notation); h_tilde_t = C_tilde_t (Andrew’s notation); z = Gamma_u (update gate)

The equation (8) is h_tilde_t (C_tilde_t) see the screenshot taken from Cho’s article:

This is inconsistent with your illustration. I think the inconsistency is not due to different notation!

That is what is confusing to me!

Thanks again.

Right. Cho’s definition and Andrew’s definition is not same. Andrew’s definition is commonly used recently. (I believe it is not defined by Andrew, though… )

But, this kind of re-definition occurs frequently, since it is not part of math, but a definition. And, implementations are also different, since there is another factor, i.e, computational speed, readability of code, interfaces among components, and so on.

The important thing is to understand the theory behind equations.