NMT Week-1 Assignment - Training

Hi!

You are right about how decoder is used in training and prediction time.

In the code, during training, pre-attention decoder hidden state of LSTM are not passed to the attention mechanism. This is because we are using the shifted-right target sequence to attend the correct next token. This target sequence passes through the pre-attention decoder and only the output of the LSTM along with the encoded context is used in attention mechanism as query and value respectively.

I am referring a previous discussion on this. FYI, here implementation in paper is discussed which is slightly different than the TF code.

1 Like