Transformers architecture - Week 1 | Coursera

In the lecture titled - Transformer Architecture in the course “Intro to LLMs and Gen AI Project Lifecycle”, Mike says “The power of the transformer architecture lies in its ability to learn the relevance and context of all of the words in a sentence”

My understanding is that the ability to “learn the relevance and context of all words” should be credited to the attention mechanism. For example, an RNN based encoder decoder with attention will learn from the context of all words in a sentence.

I also believe the power of Transformers comes from their ability to be trained in parallel (full sentence as a whole) compared to RNNs where training needs to be sequential.

Do I have this right?

Yup, the attention mechanism enables the model to learn the relevance and context of the words in a sentence.

You can use an attention mechanism for RNNs as well (and I believe the original attention mechanism was developed for RNNs). However, the attention mechanism used in Transformers have proven to be more effective (broadly speaking) than when used with other models like RNNs.

Yup, Transformers have more parallelizable components when compared with RNNs like LSTM. This is one of the advantages of Transformers.