Course 5 week 1 - GRU and Sampling novel sequences questions

Hi, guys, there are two concepts I am having problem understanding:

  1. the Gated Recurrent Unit(GRU). I know this is basically for solving the vanishing gradient problems and to get a long-range effect. in a way, it’s using c and c tilde to mark the word and when to update it. but I didn’t get the full GRU version, why do we need to add a gamma r?

2.Sampling novel sequences. using y hat as the input of x<2>, isn’t it the same as the last slide RNN model? In the RNN Model, the y hat are also produced through softmax prediction, just without randomly sampled.
so what are the differences between RNN model and sampling a sequence from a trained RNN? and between training and sampling?

Hey @Lostfinger,
Apologies for the delayed response. As far as I can see in the lecture video, Andrew answers your first query. Gamma_r stands for Relevance Gate, and helps in determining the relevance of c(t-1) for computing c~(t), and he also mentions the formulation for computing Gamma_r. Also, Andrew mentioned that these are just different versions of GRU, and over many years, researchers have experimented with different versions, to find the one most suitable for their application.

As for your second query, I am assuming that you want to ask “What is the difference between training of a RNN model and sampling from a trained RNN model?”.

Consider that we have an untrained RNN model, and we are training it with the help of a simple example, “I like music EOS”. So, for this, we will feed in a “0 vector” as x_1 to the first cell, and we will try to make the model predict “I”. Now, in the second cell, we will be feeding in “I” as x_2. Here, note that it doesn’t matter what the first cell predicted, we will always feed in the correct words as x_i(s), since, we have the training data (i.e., entire sentences), i.e., x_2 may or may not be equal to y_hat_1.

But if we consider the sampling phase, in that case, x_2 will be equal to y_hat_1. So, if the first cell predicted “I”, i.e., correct word, we will be feeding in “I” to the second cell, and if the first cell predicted some other word, we will be feeding in the other word to the second cell. Now, this happens because of 2 important reasons. First, we don’t have the entire sentences (since this is the inference phase), and second we don’t know what to produce (i.e., we don’t know what is the correct sentence that we are trying to produce), and that’s why the name Sampling, i.e., we are trying to generate novel data from a trained RNN model.

Just to make this answer complete, in the explanation above, I have mentioned that during sampling, x_2 = y_hat_1. However, the softmax layer produces a probability distribution, so instead of always choosing the word with the maximum probability, we sample the words according to the probability distribution generated by the softmax layer, which is another reason to refer to this process as Sampling. This provides us with a stochastic behaviour, and makes sure that our model doesn’t always generate the same output. I hope this helps.

Regards,
Elemento