In the lecture titled - Transformer Architecture in the course “Intro to LLMs and Gen AI Project Lifecycle”, Mike says “The power of the transformer architecture lies in its ability to learn the relevance and context of all of the words in a sentence”
My understanding is that the ability to “learn the relevance and context of all words” should be credited to the attention mechanism. For example, an RNN based encoder decoder with attention will learn from the context of all words in a sentence.
I also believe the power of Transformers comes from their ability to be trained in parallel (full sentence as a whole) compared to RNNs where training needs to be sequential.
Do I have this right?