RNN lecture and programming exercise: activation 0

Hi, am not understanding a key concept with RNNs, specifically with respect to what a0 is, both in the lecture and in the programming exercise. How is that initialized again and why can’t you just start with x for the first layer? I understand a1, a2…aN but a0 is confusing me since we are feeding x separately.

Yes, this is one of the key ways that RNNs are different than DNNs or CNNs. The a^{<t>} values are completely different than the x^{<t>} values. The x^{<t>} are the input values at each timestep. The a^{<t>} values represent the “hidden state” of the cell and it also changes at every timestep based on the inputs it has seen up to that point. The dimensions of the two different values are not even the same, so you can’t use the x values to initialize the hidden state. The size of the hidden state is a hyperparameter that you have to choose. It represents the complexity of the amount of state that the RNN needs to track in order to be successful at whatever the processing task is that you are trying to accomplish with the RNN (e.g. translating French into English).

I realize this is a pretty short and probably not very satisfying answer, but I think the best idea is to listen to the lectures again with what I said above in mind and hopefully it will make more sense this time around. Specifically listen for what Prof Ng says about the concept of the “hidden state” of the RNN cell.

Oh, sorry, perhaps I have misinterpreted your question. You’re not asking what the “hidden state” is, but simply how we initialize it. I did point out that you can’t use the x values for that purpose, since they aren’t even the same dimensions. You’ll see in the notebook that they tell you to initialize the hidden state as a tensor of zeros with the appropriate shape. I think it’s also possible to use random initialization, but they have us use zero initialization. It’s interesting to “hold that thought” and see if we run into any cases later in the course that deal with this differently. Perhaps we could even run some experiments with some of the later models that we train: see if there is a difference in performance or the training time between the zero initialization and random initialization cases. Note that there are also weights that we need to initialize and the usual “symmetry breaking” requirement applies to those (as with DNNs and CNNs).

1 Like

@paulinpaloalto, very interested now in seeing how the performance will compare for random vs zero initialization. Suspect the performance will be better when randomized. Planning to go through the notebook and parts of the lecture again tonite. Anyhow, thank you for taking the time to answer my questions, which you’ve done for the third time now! Feels pretty good to know someone’s there when I’m scratching my head.