How Transformer model works? Unclear after the lab

FFelix · August 16, 2021, 10:06am

From the last weeks’ lab on the Transformer model (the graded one) 3 questions remain unclear:
1.
it is unclear how Dense layers are applied (after attention)
I guess that same weights must be applied for every position in the encoder/decoder vectors?
But in the lab a simple Dense layer is used.
While for the similar case in the previous lab on LSTM - TimeDistributed wrapper around Dense layer is used. Could someone clarify this? It would be ideal to see this clarified in the lab itself.
2.
The final Transformer model expects 3 different attention masks: 1 look-ahead and 2 different padding masks.
Why padding in the encoder and in the decoder’s attention to encoder outputs are different?
3.
If I got it right, the decoder generates output words one by one, by applying the Decoder again to the same output of the encoder + already generated part of the final output sequence.
But the final transformer model, that we created in the lab, outputs all the time-steps at once. Does it mean that potentially it can output different words even for the time steps that were provided as an input to produce the next word? And is also does unnecessary computations to compute already produced outputs? Those, previously produced timesteps, are simply discarded?

FFelix · August 19, 2021, 6:37am

@TMosh Could someone, please, clarify those questions on the last lab on Transformer model?

TMosh · August 19, 2021, 7:22am

Sorry, I have not looked at the labs yet.

jonaslalin · August 19, 2021, 7:12pm

Hello @FFelix,

Interesting questions!

Pointwise feed-forward networks

In TensorFlow 2, you don’t need the TimeDistributed wrapper. I will report this upstream to have the other lab changed.

The Dense layer uses the same weights for every time step. In the Transformer case, this is for each item in the sequence (which itself has d_model numbers). I have created a notebook where you can see this.

As a bonus, I have added 1D convolutions, which also create pointwise feed-forward networks when given inputs of shape (batch, time, features).

jonaslalin · August 19, 2021, 7:36pm

Decoder padding mask

Hello @FFelix,

Where have you seen that they are different? Since Key and Value come from the Encoder, we need the padding mask from the Encoder to use in the Decoder. They should be the same.

jonaslalin · August 19, 2021, 7:57pm

Inference

It is not unnecessary. Remember that self attention is also used in the decoder, so all previous words can be used to generate the current word. It can be the case that the current output for previous words are not the same. Usually teacher forcing is used, so we feed it the previously generated sentence + current word, and throw away predictions for words we already have generated in other iterations, for the next iteration.

FFelix · August 20, 2021, 5:29am

Thanks for claryfiing this! Am I right that in case of video processing (having video timestamps and 2D spatial structure) - Conv2D can be used without any TimeDistributed wrapper as well?

FFelix · August 20, 2021, 5:32am

in the following code from the lab we have 3 parameters for padding. I am confused by a separate dec_padding_mask param:

class Transformer

 def call(self, input_sentence, output_sentence, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):

FFelix · August 20, 2021, 5:35am

Thanks for claryfying it.
BTW, is smth like Beam search, described on the RNN based decoders, applied to Transformer decoders as well? It seems quite natural to apply it there

jonaslalin · August 20, 2021, 7:14am

Yes, from my understanding, you never need TimeDistributed for Keras layers. Maybe it can help you if you have a custom third party layer that doesn’t play well with time dimensions.

jonaslalin · August 20, 2021, 7:15am

FFelix:

in the following code from the lab we have 3 parameters for padding. I am confused by a separate dec_padding_mask param:
class Transformer

 def call(self, input_sentence, output_sentence, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):

I think it only is for clarity, but in this case, one padding_mask parameter would suffice, since enc_padding_mask and dec_padding_mask are the same.

jonaslalin · August 20, 2021, 7:18am

Beam search will improve the results dramatically. As an extra exercise, please add it to your implementation

Topic		Replies	Views
Week 4: Transformer Network (test time intuition) Sequence Models	1	514	April 21, 2022
Parallelism At Decoder Layer In Transformers Sequence Models	6	612	June 24, 2023
Transformer Network - Question about "N" Sequence Models week-4	5	141	May 12, 2024
Masked Attention Transformers Sequence Models	6	722	September 27, 2024
Problem with understanding tl.Serial NLP with Sequence Models week-3	3	565	July 1, 2022

How Transformer model works? Unclear after the lab

Pointwise feed-forward networks

Decoder padding mask

Inference

Related topics