There are 2 ways of generating sequences. Both of them involve predicting \hat{y}^{<t>} based on the input at the current time step and the activation from the previous time step. As far as the 1st token is concerned, you can make both x^{<t>} and a^{<t-1>} as zeros when training the model. This is like passing a dummy START_TOKEN
to to generate output for the 1st time step.
Here’s how both types of sequence generators differ:
- If the output at the current time step is directly used in predicting the output, then, output is going to correspond to the most frequently occurring token at start of input.
- If a random token is sampled based on the output of the current time step, i.e. \hat{y}^{<t>} is used in sampling 1 token from all tokens, then, we can generate novel sequences.
As far as the dimensions go, \hat{y}^{<t>} has vocabulary as its dimension. Please read this topic on the dimension of hidden state.