4 Questions on Transformers

Assume the transformer is trained on 512 max length sentences:

  1. can we fine-tune it on 256 max-length sentences?
  2. If we can fine-tune it, how is it even possible because the input shapes are different, and how is the change happening in high-level overview from weights matrices to end layer
  3. Does every transformer decoder processes one token at the same time, or does it process all tokens at a time if it doesn’t, how the decoder process all tokens at the same time like how a single transformer is able to manage multiple variable input lengths, like after attention mechanism there is Fully Connected Neural Network right, how is variable length is compatible with this FNN?
  4. I have doubt that the process of output is different for Sentiment Analysis and Text generation in Transformer Decoder architecture because in text generation, the decoder process one token at a time, while the sentiment analysis need not be one-token, it can be all tokens at the same time, right? So, how is this difference in both examples, the decoder is able to capture by its architecture, Does all decoder process one token at the same time or all tokens at same time?, If not how is same architecture is able to capture both examples as mentioned above?

Hello Arjun,
Thanks for asking these questions. I will do the best I can to help you out.
1- Yes, you can fine-tune a transformer on 256 max-length sentences even if it was originally trained on 512 max-length sentences. This is because the transformer architecture is very flexible and can be adapted to different input lengths. To fine-tune the transformer, you can use a technique called warm-starting.
2- When fine-tuning a transformer, the input shapes are different because the new input sentences are shorter than the original input sentences. However, the change in the input shapes is not a problem because the transformer architecture is able to handle variable input lengths. The transformer does this by using a technique called padding. Padding involves adding zeros to the shorter input sentences to make them the same length as the original input sentences.

Here is a more detailed explanation of how padding works:

  1. The transformer takes in a sequence of tokens.
  2. The transformer checks the length of the sequence.
  3. If the sequence is shorter than the maximum length, the transformer adds zeros to the end of the sequence until it is the same length as the maximum length.
  4. The transformer then passes the padded sequence to the FNN.
  5. The FNN processes the padded sequence and outputs a prediction

3- The transformer decoder processes one token at a time. This is because the decoder is an auto-regressive model. An auto-regressive model is a model that can only predict the next token in a sequence based on the previous tokens in the sequence. The decoder uses the attention mechanism to attend to all of the tokens in the input sequence when predicting the next token. This allows the decoder to take into account the context of the entire input sequence when generating the output sequence.
4- The process of output is different for sentiment analysis and text generation in the transformer decoder architecture because the two tasks require different types of output. Sentiment analysis requires a single output, which is the sentiment of the input sentence. Text generation requires multiple outputs, which are the words in the output sentence. The transformer decoder is able to handle both of these tasks by using a technique called beam search. Beam search is an algorithm that generates a sequence of tokens by considering all of the possible sequences and selecting the sequence with the highest probability.

I hope this answers your questions. Let me know if you have any other questions.
Can Koz

Hey @Arjun_Reddy,
Let me give in my 2 cents on this.

As Can pointed out, Transformers use max-padding to make sure that all the input sequences have a uniform length when fed as inputs. This is a technique that is used by many NLP architectures and not just by transformers. And with this, you can fine-tune your transformer with any length of sentences, as long as it is less than the length of the sequences on which the transformer is trained.

If say your transformer is trained on 512-length sequences, and you have sequences in your inputs with a max length of 1024, then it becomes even more interesting. If only a minority of sequences have length greater than 512, then you can simply trim your sequences to 512 tokens, and use max-padding for sequences having less than 512 tokens. If a majority of sequences have length > 512, in that case, trimming might not be a good option since most of your input sequences might lose their meaning in the process. For that, the best option would be to find a transformer which has been trained on 1024-length sequences.

I guess a part of this question Can has already answered, i.e., the transformers use max-padding to deal with sequences with different lengths. Another thing to note here, which is more interesting in my opinion is the fact that the process differs in Training and Inference. During training, we use something known as Teacher Forcing, more of which you can read about here.

Please make sure you understand teacher forcing before you read ahead!

During inference, the decoder processes only one token at a time (i.e., it looks at the previous tokens and produces only “1” token as the output). But during training, when we use “teacher forcing”, we can parallelize the decoder to produce multiple tokens simultaneously. Note carefully that the decoder still produces one token only, but we can create, for instance, 10 processes, in which, we will feed 10 different sequences, to get 10 different tokens simultaneously.

I guess your previous question was also inspired by your thought process for “how to use transformers for sentiment analysis”. Please note that for different applications, we can easily use different components of transformers themselves. i.e., we can modify the architecture of transformers as per the application. For Text-Generation, we usually use Encoder + Decoder, for sentiment analysis, we can get good performance using just the Encoder.

Let us know if this helps.


1 Like