I more-less understood GRU, but not understand why do we need Гr at the end. Is only updating by sigmoid coefficient not enough? Why do we need one more coefficient?

Hi someone555777,

As explained in the video, the sigmoid function serves to output a value close to 0 or 1, thereby effectively turning the update and reset gates into switches with values close to 0 or 1.

As described in the original paper, if the reset gate (gamma r) is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This way old information that is found to be irrelevant is dropped. So the reset gate determines whether or not old information becomes part of the new information in c~<t>.

The update gate controls whether information from the previous hidden state will carry over to the current hidden state. Either c<t> is updated to the new c~<t> (which may or may not contain information from the past), or it simply retains the old information.

As the authors write (p. 3): “As each hidden unit has separate reset and update gates, each hidden unit will learn to capture dependencies over different time scales. Those units that learn to capture short-term dependencies will tend to have reset gates that are frequently active, but those that capture longer-term dependencies will have update gates that are mostly active.”

I hope this clarifies.

so, is Гr for short-term dependencies and Гu for long-term? Still not fully understand how it works, hontestly. Looks like formula is near to identical.

The formulas look identical but they have different weights and different functions. I do think these kinds of mathematical implementations of conceptual ideas could benefit from more explicit explanation. How do you capture abstract ideas with mathematical implementations? Unfortunately I am not able to elaborate on this; if there were a course on this I would take it. For now, I accept these mathematical implementations work the way they are described.

The similarity between the second and the third formula in the screenshot of your first post is that, they are taking the same set of inputs to make two different decisions. The two decisions have been well explained by @reinoudbosch, but what I want to add is that, you can imagine that if \Gamma_r is removed, then the consequence is that \tilde{c} has no choice but to use c^{t-1} as is. One way to think of it is whether we want such freedom to only use part of c^{t-1}, but certainly it is best decided by experiments, given that we have sufficient training data to well use that additional freedom. You see - additional freedom may cause overfitting if the data is not enough.

If it turns out that \Gamma_r is unnecessary for a training dataset, then we could either find two algorithms - one with and one without it, to perform similarly, or the weights related to \Gamma_r always make it equal to 1 or very close to 1.

Lastly, to think about their difference, I won’t just look at how they are computed (the second and the third formula), but how they are used (the first and the fouth formula) - that’s certainly very different.

Cheers,

Raymond