And few additions.
- Do I understand correct, that we compute aligment and attention in this lab only for one of predicted words in translation (
decoder_state
). But usually we need to compute this attention for each next word. So, it will be multiple calls ofattention(encoder_states, decoder_state)
, whereencoder_states
will be different (different output words), butencoder_states
are the same.
- We have
hidden_size
. If I understand correct, it is words embeddings size of one word. And after this we transform it through linear regresstion and tanh. We getactivations
withattention_size
columns, that contains classification. But what is it at all? How have we got 5x10 matrix from 5 words with 16x2 word embeddings size? And why we didn’t multiply 5 on 2 btw? - As I understand we need tanh to stricktly dedicate what words in input sentence are connected with generated output. And we get approximate numbers form
alignment()
func. But after this we do softmax… Why have we not done softmax after computation of activations in aligment? Looks like that I don’t fully understand sense ofalignment()
at all