In the assignment, it appears that we use multiple LSTM layers in the encoder and only 1 LSTM layer in the pre-attention-decoder. Then again 2 LSTM layers in the post-attention decoder. What is the reasoning here?
I would have assumed that the encoder and the decoder need to have the same level of “complexity” all the time.
The depiction of the model after UNQ_C4 shows two LSTM layers for the encoder and two for the decoder. For the pre-attention layer, only a single LSTM layer is used, but this layer has a different function (to help calculate attention), and does not breach the standard LSTM Encoder-Decoder model. For an example of the latter see, e.g., this publication.