Can someone explain me what is sampling with an example in a language model?

And also,

Can someone explain me forward propagation with an example of 3 words in RNN.

Can someone explain me what is sampling with an example in a language model?

And also,

Can someone explain me forward propagation with an example of 3 words in RNN.

Also please explain me if gamma is very very close to zero that means the activation value for different layers very change very less .

I guess this would make the learning difficult .

Please correct me with this

Hi @Kamal_Nayan

It’s hard to simply explain GRU forward propagation. Here is my simple attempt to explain a simple RNN forward propagation. Similarly the calculations go for GRU but more complicated since there are more weight matrices and more activations.

Your image is a bit confusing, so let me try to explain what happens in this GRU:

Note: The image is taken from this excellent online book and my further example notation (r, z, h etc.) is in line with this picture.

As I was reading it, I implemented a simple (without embedding layer) **character level** GRU (but the calculations are the same for *word level* GRU, but usually with more than 28 targets… ).

So here are the calculations for the *greedy* predictions of GRU (in this example after 12 steps, - “traveller yo”, the calculations for predicting the 13th character are:)

Which confidently predicts the character “u” (at index 14 with the highest logit value of 20.4). As you can see the update gate values are not very high (for the first 30 out of 256 total).

So, now let’s predict the 14th character (which would go after “traveller you”):

Now the model predicts the space character " " with the highest logit value of 18.7 (again - confidently). But now you can see that some of the update gate values are higher (for the first 30 out of 256 total). (You could loosely interpret that the model understands that the word has ended).

Further let’s predict the 15th character (what would go after "traveller you "?):

Now the model predicts the character “c” with the value of 12.2. This time the model is not that confident and the character “t” is the next best character with the logit value of 10.2.

Coming back to your original question of “*sampling*”, if the model was not *greedy* it could toss a coin and choose “c” or “t” or maybe “s”. (We could have applied softmax on these logits to get probabilities like values and sample accordingly.)

This post was not that straight forward ( pun intended) so if you have questions, feel free to ask.

Cheers

Ok, it is bit confusing

Can we make it clear in steps ?

please check my understanding with this !

1)GRU units learn from previous inputs also and for dealing with the vanishing gradient problem in this we make a memory cell and a update gate . The update gate keeps helps us remember the value.

2)Sampling basically means getting the probabilities of all the characters that can be at that position and then choosing the best one.And this is done by training our model with our dataset.

3)And about my question of update gate , check if the following is correct.

basically if our update value is low then c ~= c , so if c is not changing much then our model is trained by values that were almost similar to the earlier layers Am i correct with this understanding

Well… I think you are correct if I interpret your sentences correctly:

Intuitively, the reset gate controls how much of the previous state we might still want to remember. Likewise, an update gate would allow us to control how much of the new state is just a copy of the old state.

Not quite. There are different sampling techniques (*during inference*). The “greedy” sampling is as you said - choosing the best one. While there are other sampling techniques which have some parameter (usually called temperature), that helps you control how “greedy” you want to be. For example, if you had the probabilities of [0.1, 0.2, 0.7], the “greedy” version would always choose the character at the third position (0.7), while others, depending on the temperature (and other) settings might sample from the distribution [0.1, 0.2, 0.7] or [0.0001, 0.05, 0.9499] or other.

And usually that (sampling) has nothing to do with training the model on some dataset.

If I understand you correctly - no.

First, *update* gate value is not c (c in your picture is the **hidden state**, or H_t in my previous illustration), but update is z (\Gamma_u in your picture, Z_t in my illustration), which **influences** the *hidden* state (but is not the same).

Second, the update gate value is calculated at every step (from previous hidden state and current input).

Third, it depends how you understand layers - I think you confuse layers vs. steps (check this post).

So, if you understand what I wrote, then the update value controls how much this **step** (hidden state of this step c_t) is a copy of the *previous* hidden state (c_{t-1}).

Cheers

Also can you explain what is the problem in picking up the highest probability word

and passing it to next time step

It will overuse the most probable word.

and what is sampling actually doing , can you explain with an example

Sampling introduces randomness into the word selection process and helps us avoid repetitive sequences in the generated text.

For example, if you are trying to generate a sentence like “The sky is blue and the ____ is blue,” and always pick the highest probability word “sky,” you might get outputs like “The sky is blue and the sky is blue.” But with random sampling, you will get “The sky is blue and the ocean is blue” or “The sky is blue and the grass is blue.”

Each time we sample, we might choose a different word, leading to more varied and interesting results.

ok Thanks