Sampling process and loss function in sequence models

Hello,
I have just finished the 2nd assignment of the 1st week : “Dinosaurus_Island_Character_level_language_model”, and there are 2 things I am confused about :

  1. During the FW prop of the training, we have the cross entropy loss, but the loss of what exactly ? I understand that y_pred is a softmax where the highest prob is the one of a next coming letter. But is it just measuring the difference between the “real” letter that should normally occur and the “predicted” letter ? In other words, this is the loss of what exactly ?

  2. In the case where we add the sampling process, during the training phase, in order to get new words, this is where I get confused : we are not predicting the most likely coming letter anymore because we select randomly another letter instead. So what are we able to learn if every time we let the random process do the job ? How can we get Dino’s name that look to be “authentic” here ?

Thank you to deliver me from my actual confusion.

When training, cross entropy loss encourages the model to follow patterns in naming dinosaurs. So, if the character s comes after the letter u, we want the model to capture that pattern. There is no sampling involved in this stage.

Sampling process introduces some level of randomness to encourage the model to get creative (at inference time) in generating new names while following the learnt character distributions for each character in the input.

1 Like

Hello,

Thank you for your previous response. I will just add some further information, for you to get my point that was not developed enough previously. The 2 parts are independently discussed here.

  1. LOSS FUNCTION
    I know that the sampling is not involved for the cost function during training. Specifically : we want to capture/learn the patterns that exist within the letters, ok. How is the cost function able to get how good our “patterns understanding” is ? The cross entropy used in this case, measures what kind of distance here (if it is a distance, which I am not sure too) ?

  2. SAMPLING PROCESS
    If I am following the probability distribution I would take the next letter that is the most likely to occur, given this probability distribution I am following. When I am involving the random selection, what am I doing exactly such that the names at the end still look realistic ? Is it like, instead of taking the most likely next coming letter, I am using the 2nd or 3rd most likely letter ?

Thank you

Part 1
Cross-entropy loss was introduced course 1 week 2 assignment 2 for 2 classes. We can extend binary cross-entropy to support multiple classes like this for each sample in the batch:
Loss = -\sum_i y_i * log (\hat{y}_i)
where
i = number of classes

As you might observe from the above expression, when the model predicts low probability for the actual class, log(low probability) will be a high negative value (since \hat{y} \lt 1). The leading negative sign makes the overall loss positive. Loss is minimized when the correct class is predicted with highest probability (log(1) = 0)
With this background, we can see that at each timestep, using cross-entropy loss encourages the model to output the correct next character for the current input character. This helps capture the patterns in pairs of adjacent characters.

Part 2
You are on the right track.

There are 2 ways of using this kind of a generative model:

  1. One way is to ask the model to predict probabilities for next character for the input character and use it as is. If the model is trained well, we should get names that are in the dataset.
  2. When we want the model to generate new dinosaur names , randomness is introduced to pick the next character that follows the predicted distribution for the next character. This part is covered in the assignment.

Thank you for your previous reply.

Fine for part 1.

For part 2 there is still a question in my mind : the cross entropy brings my model to predict the next most likely character given my current one. In my dataset, I will learn the global patterns that exist between adjacent characters. Ok good for this.
Now, during inference, I am adding randomness by considering my current probabilities distribution for the next character. In this case, I am not choosing the most likely one anymore, but still, after many iterations I get finally realistic names. How is this possible ? I am “blocking” the model to chose the best choice for the next character again and again, so I am disturbing the entire training process. I do not understand how actually the model is able to convergence to a good end result.

Adding a temperature factor (see this short course) to the predicted outcomes is done to get the model to generate novel outputs within acceptable parameters. The trained model will emit outcomes that aren’t completely random.

You also might want to check Generational Adversarial Networks (GAN) to get a better idea on how one trains / uses models for generational tasks.