How Transformer model works? Unclear after the lab

From the last weeks’ lab on the Transformer model (the graded one) 3 questions remain unclear:
it is unclear how Dense layers are applied (after attention)
I guess that same weights must be applied for every position in the encoder/decoder vectors?
But in the lab a simple Dense layer is used.
While for the similar case in the previous lab on LSTM - TimeDistributed wrapper around Dense layer is used. Could someone clarify this? It would be ideal to see this clarified in the lab itself.
The final Transformer model expects 3 different attention masks: 1 look-ahead and 2 different padding masks.
Why padding in the encoder and in the decoder’s attention to encoder outputs are different?
If I got it right, the decoder generates output words one by one, by applying the Decoder again to the same output of the encoder + already generated part of the final output sequence.
But the final transformer model, that we created in the lab, outputs all the time-steps at once. Does it mean that potentially it can output different words even for the time steps that were provided as an input to produce the next word? And is also does unnecessary computations to compute already produced outputs? Those, previously produced timesteps, are simply discarded?


@TMosh Could someone, please, clarify those questions on the last lab on Transformer model?

Sorry, I have not looked at the labs yet.

Hello @FFelix,

Interesting questions!

Pointwise feed-forward networks

In TensorFlow 2, you don’t need the TimeDistributed wrapper. I will report this upstream to have the other lab changed.

The Dense layer uses the same weights for every time step. In the Transformer case, this is for each item in the sequence (which itself has d_model numbers). I have created a notebook where you can see this.

As a bonus, I have added 1D convolutions, which also create pointwise feed-forward networks when given inputs of shape (batch, time, features).

Decoder padding mask

Hello @FFelix,

Where have you seen that they are different? Since Key and Value come from the Encoder, we need the padding mask from the Encoder to use in the Decoder. They should be the same.


It is not unnecessary. Remember that self attention is also used in the decoder, so all previous words can be used to generate the current word. It can be the case that the current output for previous words are not the same. Usually teacher forcing is used, so we feed it the previously generated sentence + current word, and throw away predictions for words we already have generated in other iterations, for the next iteration.

Thanks for claryfiing this! Am I right that in case of video processing (having video timestamps and 2D spatial structure) - Conv2D can be used without any TimeDistributed wrapper as well?

in the following code from the lab we have 3 parameters for padding. I am confused by a separate dec_padding_mask param:

class Transformer

 def call(self, input_sentence, output_sentence, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):

Thanks for claryfying it.
BTW, is smth like Beam search, described on the RNN based decoders, applied to Transformer decoders as well? It seems quite natural to apply it there

Yes, from my understanding, you never need TimeDistributed for Keras layers. Maybe it can help you if you have a custom third party layer that doesn’t play well with time dimensions.

I think it only is for clarity, but in this case, one padding_mask parameter would suffice, since enc_padding_mask and dec_padding_mask are the same.

Beam search will improve the results dramatically. As an extra exercise, please add it to your implementation :smiley: