Understanding of basic Attention code

And few additions.

  1. Do I understand correct, that we compute aligment and attention in this lab only for one of predicted words in translation (decoder_state). But usually we need to compute this attention for each next word. So, it will be multiple calls of attention(encoder_states, decoder_state), where encoder_states will be different (different output words), but encoder_states are the same.
  2. We have hidden_size. If I understand correct, it is words embeddings size of one word. And after this we transform it through linear regresstion and tanh. We get activations with attention_size columns, that contains classification. But what is it at all? How have we got 5x10 matrix from 5 words with 16x2 word embeddings size? And why we didn’t multiply 5 on 2 btw?
  3. As I understand we need tanh to stricktly dedicate what words in input sentence are connected with generated output. And we get approximate numbers form alignment() func. But after this we do softmax… Why have we not done softmax after computation of activations in aligment? Looks like that I don’t fully understand sense of alignment() at all