RNN for speech recognition

There are 2 ways of generating sequences. Both of them involve predicting \hat{y}^{<t>} based on the input at the current time step and the activation from the previous time step. As far as the 1st token is concerned, you can make both x^{<t>} and a^{<t-1>} as zeros when training the model. This is like passing a dummy START_TOKEN to to generate output for the 1st time step.

Here’s how both types of sequence generators differ:

  1. If the output at the current time step is directly used in predicting the output, then, output is going to correspond to the most frequently occurring token at start of input.
  2. If a random token is sampled based on the output of the current time step, i.e. \hat{y}^{<t>} is used in sampling 1 token from all tokens, then, we can generate novel sequences.

As far as the dimensions go, \hat{y}^{<t>} has vocabulary as its dimension. Please read this topic on the dimension of hidden state.