Sampling and RNNs

It’s hard to simply explain GRU forward propagation. Here is my simple attempt to explain a simple RNN forward propagation. Similarly the calculations go for GRU but more complicated since there are more weight matrices and more activations.

Your image is a bit confusing, so let me try to explain what happens in this GRU:

Note: The image is taken from this excellent online book and my further example notation (r, z, h etc.) is in line with this picture.

As I was reading it, I implemented a simple (without embedding layer) character level GRU (but the calculations are the same for word level GRU, but usually with more than 28 targets… ).

So here are the calculations for the greedy predictions of GRU (in this example after 12 steps, - “traveller yo”, the calculations for predicting the 13th character are:)

Which confidently predicts the character “u” (at index 14 with the highest logit value of 20.4). As you can see the update gate values are not very high (for the first 30 out of 256 total).

So, now let’s predict the 14th character (which would go after “traveller you”):

Now the model predicts the space character " " with the highest logit value of 18.7 (again - confidently). But now you can see that some of the update gate values are higher (for the first 30 out of 256 total). (You could loosely interpret that the model understands that the word has ended).

Further let’s predict the 15th character (what would go after "traveller you "?):

Now the model predicts the character “c” with the value of 12.2. This time the model is not that confident and the character “t” is the next best character with the logit value of 10.2.

Coming back to your original question of “sampling”, if the model was not greedy it could toss a coin and choose “c” or “t” or maybe “s”. (We could have applied softmax on these logits to get probabilities like values and sample accordingly.)

This post was not that straight forward ( pun intended) so if you have questions, feel free to ask.

Cheers