Hey @20020069_Le_Thai_S_n,
Welcome, and we are glad that you could become a part of our community
First, I would like to thank you for creating this thread. I also learnt something new, while trying to curate the answer to your thread.
This is because the model outputs the probabilities corresponding to each of the positions, i.e., if we have padded_length = 100
, it will output the probabilities corresponding to each of the 100 positions (or tokens). And the model has been structured to do this exactly, since in order to compute the loss for each of the tokens during training, we need the probabilities corresponding to each of the tokens. I hope this makes sense now.
I don’t think it should be an issue. It’s quite analogous to CNNs being used on multiple examples simultaneously, during inference. We just need to make sure that the padded_length
for each of the examples in a single batch is the same. And I guess that should be it.
As to the speed comparisons between transformers and CNNs, I am not sure whether this is a question that we should even think about. Note that CNNs are designed to exploit spatial information, which is not an attribute of any natural language application. On the other hand, sequence models and transformers are designed to exploit temporal information, which happens to be a key attribute of every natural language application.
If you really think this is a question of concern, feel free to train a CNN based architecture and a transformer based architecture for any natural language application, and you can decide for yourself, whether you want a faster CNN-based architecture with a huge drop in the performance or not. Please do share your results with the community.
Honestly speaking, I never thought about this at all prior to your question. I checked this article out, and in that, Text Summarization was included as an application for the Encoder-Decoder architecture. So, what are we missing here? Turns out @arvyzukai has already posted an answer to this question, which you can find here.
Let us know if this helps.
Cheers,
Elemento