Week 1, Programming assignments (# 2, #3 )

I have a question about week 1 assignments for the sequence models. I understand that for the generations tasks, we need to feed the output from previous time of t to the next one as an input. So, in order to create new word or character, we need to use the generated ones in previous step. However, I am not sure how to implement that in the code particularly in the model function of generating characters for dinosaurs names, I am not confident which slices of X do we use for Y to optimize our model. Do we use all X created so far and use that as predictions? Please help me understand the code for this section.

Thank you,

I recommend you read the instructions in the notebook very carefully,.

1 Like

The instructions tells : Set the list of labels (integer representation of the characters): Y

  • The goal is to train the RNN to predict the next letter in the name, so the labels are the list of characters that are one time-step ahead of the characters in the input X.
    • For example, Y[0] contains the same value as X[1]
  • The RNN should predict a newline at the last letter, so add ix_newline to the end of the labels.
    • Append the integer representation of the newline character to the end of Y.
    • Note that append is an in-place operation.
    • It might be easier for you to add two lists together.
      So, in the function “model” in the Dinosaur Island assignment, the model iterates over the words in the dataset, decomposes each into the index and characters indices and uses that as the optimization entries for that particular learning step. So, basically, it learns from each words and updates its activations based on the words sequence of characters. So, X is basically the characters and Y should be one step ahead. When I put it in words, I understand it, but it does not click for me the whole process. I still have some difficulty understanding the character level optimization, maybe it is because it is stochastic GD.

Yes, the instructions here are very detailed and it sounds like you have understood them correctly.

I’m not sure I see why the size of the batch for GD affects the intuition here. It’s just a question of whether you are averaging the gradients over multiple samples or not. The learning should end up being statistically the same, although the exact path you take along the solution surface to get there may vary a bit.

Maybe the other key thing to keep in mind here is the difference between training and inference mode. In inference mode, they also introduce randomness at the level of the individual characters just to keep things more interesting, meaning that the model doesn’t always generate the same name every time.

My apologies if I have just missed the full subtlety of what you are saying here. :nerd_face:

1 Like

It is resolved. Thank you. :slight_smile:

1 Like