RNN for speech recognition

In the video of 1st Week of C5, Language Model and Sequence Generation, as shown in the screen shot below, we input all zeros vectors, namely x<1> =0 and a<0>=0, but we could still get some probabilities for the words in vocabulary. So the question is how is that possible to get some result although the inputs are trivial or just zeros.?
My second question is: what are the dimensions of y s and a s? So Andrew means we implement softmax for the probabilities . This would mean y s and a s have the dimension of the vocabulary, which is 10k. Is this then correct?

There are 2 ways of generating sequences. Both of them involve predicting \hat{y}^{<t>} based on the input at the current time step and the activation from the previous time step. As far as the 1st token is concerned, you can make both x^{<t>} and a^{<t-1>} as zeros when training the model. This is like passing a dummy START_TOKEN to to generate output for the 1st time step.

Here’s how both types of sequence generators differ:

  1. If the output at the current time step is directly used in predicting the output, then, output is going to correspond to the most frequently occurring token at start of input.
  2. If a random token is sampled based on the output of the current time step, i.e. \hat{y}^{<t>} is used in sampling 1 token from all tokens, then, we can generate novel sequences.

As far as the dimensions go, \hat{y}^{<t>} has vocabulary as its dimension. Please read this topic on the dimension of hidden state.

Well I am kind of more confused now. So in the video the training is done, using your 1st type of generating sequences, which takes the correct words or tokens in a sentence as input and as output we get a probability vector, y hat .
But why do we need sampling process at all? Why does np.random.choice do? Is teh sampling process a continuation of training the model?
I mean let´s say we get the output y hat at tme step t. This output already gives us the probabilities of all tokens in the vocabulary and therefore we can obtain the most common or probable token from that ouput. In this case why do we need sampling?
Sorry for possibly bothering you with my questions but I just want to get it right.
Thanks for your patience!

No worries.

As far as training is concerened, the 1st method is used.
The 2nd method is used only after training. Please see this lecture

Yes I did watch that video but still there is no explanation of why we need sampling. What does sampling do at all?
In the sampling process we are getting only the most comon words in the literature or training text, which has nothing to do with our input speech sentence that is supposed to be recognized by the machine.
Ok apprently I need to watch some other explainer videos.
Thanks for answers.