C5 W1 A2 UNQ_C4 shouldn't update a_prev

In the template where we train the model we have a prepared line

        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        # Choose a learning rate of 0.01
        curr_loss, gradients, a_prev =

So a_prev is updated each iteration. Thus for the next word we’ll pass a_prev from previous dino name. Should we do that? I would expect a_prev to be zeros for each word learning. So we must not update a_prev, but always pass a_prev = np.zeros((n_a, 1))

a_prev represents the hidden state of the LSTM.
If you don’t update it on every optimization step, it won’t learn very much.

But learning is not about improving a_prev, it is about evolving matrixes (Waa, Wax, Wya…) and bias. Imagine, you’ve learned the model and then transfer it to me. You’ll only pass the matrixes and bias. You won’t give me special a_prev to start generation. I’ll pass all zeros as a_prev.

And in practice it is not correct, that without evolving a_prev the learning wouldn’t work.
If I use code from corse:
Iteration: 22000, Loss: 22.728886
But if you always pass zeros as initial a_prev, I’ve got
Iteration: 22000, Loss: 21.885259
So it learns better.

I think it is a mistake in the course template code. It should be
curr_loss, gradients, a_which_would_not_be_used =
And the answer check should be updated correspondingly (to Trocnmoraurus)

The difference in those loss values is not significant.

Taking into account
Iteration: 0, Loss: 23.087336
I can’t say the difference between 22.7 and 21.9 isn’t significant.

But still it is not the question about this particular figures. My question is about best practices. I’ve found this in lecture https://www.coursera.org/learn/nlp-sequence-models/lecture/ftkzt/recurrent-neural-network-model 4:42 transcript

we’ll also have some either made-up activation at time zero, this is usually the vector of zeros.
Some researchers will initialized a_zero randomly. You have other ways to initialize a_zero but really having a vector of zeros as the fake times zero activation is the most common choice

So zero activation is the most common choice. Random activation is also mentioned. Passing the final state from previous sequence isn’t mentioned. Is it a known practice which just didn’t work good in this particular case? Or it is not a good practice in general and just a bug in the code?

Good catch @Mednikov_Leonid!

You have just had your first encounter with the so-called stateful LSTM :flying_saucer:. I don’t think Prof Andrew Ng talked about the LSTM cell during optimization (in the lectures). The hidden state resets in the stateless LSTM at each optimization step. As you have noticed already, a stateful RNN (LSTM, GRU, etc.) saves the last hidden state and uses it as the initial state for the next batch, which in this case is of size 1. Stateful RNNs are especially useful in time-series forecasting, where we cut up the time series into slices and have dependencies between these slices. We want to capture this somehow, and we can do that by not resetting the hidden state. It may or may not work great for language modeling. How do we know? As always, we treat it as any other hyperparameter and compare the performance on the validation data set. In this exercise, we have not calculated the accuracy. You should probably implement that as a bonus exercise in order to compare the two methodologies. When you are done, please, share your findings. Which method performed the best for the Dino naming task? Stateful or stateless LSTM? :cowboy_hat_face:

1 Like

OK. I’ve made a train/test split with different random seeds (11 times) and made average quality calculation each 2000 steps with zeros as a_prev and not zeros. The results are below
If we just get the last iteration quality for each random seed.
We’ll see zeros a_prev is better in most cases.

And as for theory. Your explanation describes cases where state is logically useful. If we were generating some sequences of names we should pass a_prev from previous to next word, no other options. But here names are independent, and it is difficult to understand, how the state from previous name (which describes the end of word) would be relevant for generating (the beginning) of a new name.