Dinosaurus_Island_Character_level_language_model exercise 4 (model)

hi, I have a doubt about this exercise. I wonder why during the training of the model, only for the first sample named dinosaur, a_prev is initialized with a null vector, while continuing with the iterations that apply the stochastic gradient descent instead a_prev that is used is always the last previously calculated. Yet Andrew in his lessons explained to us that a0 must be a null vector. I don’t know if I managed to explain myself, what I would do is: perform an iteration of the stochastic gradient descent using the first sample with a_prev null vector and then update the parameters of the model, for the second sample I would reset a_prev, imposing it null vector, and proceed as for the first sample. And so on. I don’t know if this is the correct way to proceed, but it seems to me the most sensible, since when we then practically apply the model already trained, a0 is a null vector and not a different vector, when we use the forward propagation relative to a sample I set a0 null vector, and after that I calculate the loss function, in the code of the exercise instead for example when working with the second sample the last a_prev calculated in the previous iteration is given in input, and subsequently we proceed with the forward propagation and therefore with the calculation of the loss function, however how can this loss function be correct if instead of using a0 = null vector, I used a0 = a_prev non-null vector. Thanks

just for fun i tried to update the code, to see what difference it made. i added a line of code that reinitializes a_perv with a null vector, in the end with the same number of iterations the model arrives at a lower loss, compared to the original version. However i don’t know if this is the correct way to proceed. Thanks

In the beginning there is no a_prev because the computations of the model just started and then in the next epoch there is an a_prev so it needs to be used because its part of the inputs as well as you can see from the optimize function!

mmm I don’t think a_prev should be used since the samples are independent of each other, there is no temporal dependence between the first dinosaur name and the second. I would agree with you if these were samples that had a temporal relationship of sequentiality, but this is not the case. anyway i’m new to the hobby so i could be wrong

As @gent.spah suggested in the context of this exercise, during training with Recurrent Neural Networks (RNNs), the hidden state a_prev (which is initialized as a zero vector for the very first time step) becomes an important part of the model’s memory since it carries information from previous time steps across the sequence. So it’s important to pass the updated a_prev to each subsequent step, rather than resetting it to a zero vector.

However, your approach is correct! In this case, each sample (name) is separate. So, reinitializing a_prev for each new sequence of dinosaur names will likely result in better generalization because the RNN can focus on modeling each name without being biased by the previous one. The lower loss suggests that this change helped the model converge more effectively.

Keep an eye on other metrics, such as the quality of dinosaur names generated (in addition to loss), to see if the improvement in loss translates into better overall performance.

2 Likes

ok thanks a lot for the clarification that’s what I thought. I don’t know much about dinosaur names, however the fact that the output names end with the osaurus suffix makes me think that the model is working well. It would be useful to add this clarification in the exercise notes, it is useful information in my opinion for the reader: if I want to create a model that outputs one dinosaur name at a time (as in the case of the exercise) to obtain a better final result it is useful to reset a_prev to the null vector after each iteration of the stochastic gradient descent; if instead I want to create a model that outputs for example 10 dinosaur names all at once, given the null vector x(0) as input, then the best solution is not to reset a_prev to zero, as proposed by the exercise.

1 Like