Language Model and Sequence Generator - Using y as input instead of y_hat

Hi,

In the “Language Model and Sequence Generator” video we generate y_hat_t based on a_(t-1) and y_(t-1), so the actual ground-truth and not the previously generated word.

However, when in the next video we are sampling sequences, we generate y_hat_t based on a_(t-1) and y_hat_(t-1). If we were to use the model mostly to generate sequences, wouldn’t it better to also train in such a scenario where we rely on previously generated words y_hat? For example by generating y_hat_t using y_(t-1), but then generate y_hat_(t+1) using y_hat_(t), y_hat_(t+2) using y_hat_(t+1) etc for say n generated words. Then move forward to generate y_hat_(t+1) using y_(t), and then regenerate again the next n generated words based on the new generated words. Would that make sense?

What you propose would be a mixture of actual data with predicted data to generate the next predicted data.

For training, you want to train with some actual inputs, generate a predicted output, and compare the prediction with the actual ground truth. Then the difference between the predicted and the ground truth will be the ‘loss’ that we need to eventually reduce or eliminate.

If we were going to use the prediction as part of the input for the next cycle, then the new predicted output is not comparable with the ground truth of that new cycle, right?

Let me explain with an example:

Lets say we have input 1 = [1, 2, 3, 4] and ground truth for this input is y = 5.
We train the model with this input 1 and the predicted output is y_hat = 8. We compare it with y=5, which is the ground truth, and we find that the loss is 8-5 = 3. We back propagate, adjust weights, etc. and get ready for the next cycle.

Now we are going to do a new cycle. In the original version we would go like:

input2 = [2, 3, 4, 5] and y = 6

But with your proposed idea we would go like: input2 = [2, 3, 4, 8] and y = 6.

You see where this is going?

Please let me know what you think.

Thanks,

Juan

Thank you for your reply! I’m not sure I understand the example, what do you mean with cycle?

Let me clarify what I meant with another example. Suppose we are building a model that generates “interesting” sequence of numbers.

We train in the input sequence: [2, 4, 8, 16]

With the method suggested in the video we have:
y_hat_1 = f(0, 0)
y_hat_2 = f(a1, 2)
y_hat_3 = f(a2, 4)
y_hat_4 = f(a3, 8)

Then we compute the loss based on:
L(y_hat_1, 2) + L(y_hat_2, 4) + L(y_hat_3, 8) + L(y_hat_4, 16),
and find the weights that minimize the loss.

What I was wondering is if we could let the algorithm rely also on generated numbers. We could for example say that we generate the next number based on ground-truth, and the next next number based on the generated next number (we could also generated next next next and so on). For example like this (I’m using the notation y_hat_x_y, where x is the index in the sequence and y is the distance from the last ground-truth character used as input):
y_hat_1_0 = f(0, 0)
y_hat_2_1 = f(a1_0, y_hat_1_0)
y_hat_2_0 = f(a1_0, 2)
y_hat_3_1 = f(a2_0, y_hat_2_0)
y_hat_3_0 = f(a2_0, 4)
y_hat_4_1 = f(a3_0, y_hat_3_0)
y_hat_4_0 = f(a3_0, 8)

Then we compute the loss as:
L(y_hat_1_0, 2) + L(y_hat_2_1, 4) + L(y_hat_2_0, 4) + L(y_hat_3_1, 8) + L(y_hat_3_0, 8) + L(y_hat_4_1, 16) + L(y_hat_4_0, 16),
and find the weights that minimize this loss.

Not sure if you see what I mean?

Thank you for your reply.

When we are training a model, there is first a forward propagation followed by a backward propagation. And this is repeated for “epoch” times. If you define batches, which is pretty standard, then each epoch goes also through ‘n’ batches.

So I call a cycle to each forward+backward prop.

Lets create yet a new example: predicting the weather in the next hour. To train this model, we have to gather actual data from the past and create a dataset to train our model. Lets say I get the following data:

Each hour the temperature in Fahrenheit:

1:00AM: 65
2:00AM: 67
3:00AM: 68
4:00AM: 69
5:00AM: 70
6:00AM: 71
7:00AM: 72
8:00AM: 73

This is going to be all the data needed to train my time series model. Now I will organize this data to build a training set. I will define that X_train will be size = 5 and with 5 temperatures I can predict the 6th temperature. I will use a ‘sliding’ window to create the different samples.

This will be then my X_train and y_train:

X_train_1=[65, 66, 67, 68, 69] y_train_1=[70]
X_train_2=[66, 67, 68, 69, 70] y_train_2=[71]
X_train_3=[67, 68, 69, 70, 71] y_train_3=[72]
X_train_4=[68, 69, 70, 71, 72] y_train_4=[73]

During training, we want to give the model the ‘truth’, have the model predict, and compare the prediction with the ‘truth’.

We start training. Let me run a training with hypothetical predictions. Remember: each cycle means one forward prop and one backward prop, and in each cycle we use one of the samples above:

Cycle 1: Input = X_train_1. y_hat_1 (predicted by the model): 89 y_train1=70 Loss: -19
Cycle 2: Input = X_train_2. y_hat_2 (predicted by the model): 82 y_train2=71 Loss: -11
Cycle 3: Input = X_train_3. y_hat_3 (predicted by the model): 75 y_train3=72 Loss: -3
Cycle 4: Input = X_train_4. y_hat_4 (predicted by the model): 74 y_train4=73 Loss: -1

See how the loss has been getting smaller and smaller cycle after cycle? the model is learning to predict the next temperature when given the past 5 temperatures. This is possible because we are calculating the loss of the predicted value against ‘actual’ data (ground truth).

Now lets pretend that instead of using the ‘actual’ temperatures we use the ‘predicted’ temperatures:

Cycle 1: We use the first X_train1 and y_train1.
X_train_1=[65, 66, 67, 68, 69] y_train_1=[70]
Input = X_train_1. y_hat_1 (predicted by the model): 89 y_train1=70 Loss: -19

Cycle 2: We take X_train2 but replace the last entry with the previous prediction
X_train_2=[66, 67, 68, 89] y_train_1=[71]
Input = X_train_2. y_hat_2 (predicted by the model): 94 y_train1=70 Loss: -29

And lets stop this new simulation here because the model is already lost. By using the generated temperature on cycle 1 for the Cycle2, we are immediately getting away from the actual data and the next prediction will be based on inaccurate data.

In my example I am using X_train with size 5, but the same would be if the size = 1, as in your example above.

Please let me know if this is a bit more clear.

Thank you for your reply.

To clarify, I’m not suggesting to use the predicted temperatures in the Loss function. I’m suggesting to use them as input of the RNN. So in the example I wrote down above the loss is:
L(y_hat_1_0, 2) + L(y_hat_2_1, 4) + L(y_hat_2_0, 4) + L(y_hat_3_1, 8) + L(y_hat_3_0, 8) + L(y_hat_4_1, 16) + L(y_hat_4_0, 16),
The right hand value here are always from the ground-truth.

Where I am suggesting to use the predicted values is in the input of the RNN. So that we predict the second number in the sequence once based on the previous activation and previous actual number, y_hat_2_0 = f(a1_0, 2), and once based on the previous activation and the previous predicted number, y_hat_2_1 = f(a1_0, y_hat_1_0). The loss then will account for both of those, and do some by comparing to the ground-truth y_2 number.

Do you see what I mean?

Hi Matteo,

Yes, that is clear to me. You are suggesting to use the prediction as part of the next input.

My point there is, and following the example of the temperature prediction: Since we are training a model, and the model knows nothing to begin with, the first predicted temperature can be crazy. For instance, the first predicted temperature can be 100 F, while the actual ground truth was, say 70. If you, for the next prediction, use the 100, what will happen with the 2nd prediction? It will be more crazy.

Yes, at training epoch 0 for sure. Just like the predictions will be crazy in the orginal formulation, given we initialize W at random. But when we minimize the loss as defined above (where both predictions are compared to the ground truth) I think we will try to find the weights that make the first prediction less crazy and the second prediction also less crazy.

What got me wondering is the way we would use the model in my example if we wanted to generate “interesting” sequence of numbers. What we could do is to get a seed input number from the user, say y_1, feed it in the RNN to get y_hat_2. Then to get y_hat_3 we would input y_hat_2 in the RNN, and so on until we reach the desired sequence length. So at “serving” time, we would be basing the generation of the next number on the previous generated number. So I wonder if it would help to mirror this process also at training time?

In the case of a model to generate ‘interesting’ numbers, where there is not a previous ground truth, I guess your idea works.

But for the case of temperature, I don’t see how it can work. If I use the 1st prediction as part of the input for the next prediction, the loss will not be real. It will be a loss calculated from 2 predictions.

I think the idea proposed here will create a positive feedback loop, which is going to be wildly unstable during training (when the goal is minimizing the cost).

In the case of a model to generate ‘interesting’ numbers, where there is not a previous ground truth, I guess your idea works.

Yes, I was mainly thinking about use-cases where we want to generate entire sequences, for example a text or music etc.

But for the case of temperature, I don’t see how it can work. If I use the 1st prediction as part of the input for the next prediction, the loss will not be real. It will be a loss calculated from 2 predictions.

I agree it doesn’t make much sense for that example. The last sentence though confuses me again, note that I’m never suggesting to compute a loss term based just on predictions, one element will always be the ground-truth.

I think the idea proposed here will create a positive feedback loop

Thank you for your reply, what is a positive feedback loop in this case?

A positive feedback loop. Let me give you an example: pretend you are driving a car. Suddenly the car starts going to the left. When you stir the wheel to change the direction to the right, you are applying negative feedback, meaning you are going in the opposite direction of the trend of the car.

With positive feedback, you would stir more to the left. You would be, not contrary, but in the same direction (left) of the car.

Does it make sense?

In the topic’s scenario, if you use the prediction to predict, that can be seen as a positive feedback loop.

I see, why do you think we would have a positive feedback loop here? I would think that in one epoch of training we would push the weights towards making both the predictions generated inputting the ground-truth closer to the ground-truth as well as the prediction generated inputting the other predictions closer to the ground-truth.

Thanks!

The first predicted value would be used as part of the input for the second cycle, correct?

If this is correct then it is like saying: yesterday at 6pm the actual temperature was 65 degrees but I will tell the model in the input that it was 85 degrees.

The model takes this input and predicts a new output, for instance: y_hat = 110 degrees.

Even if you use the truth to validate the prediction, if you now use 110 degrees as input for the next cycle, the model will continue being unpredictable.