I’ve completed week two, watched the videos multiple times and have read the “attention is all you need paper”. I feel like I have a reasonable mechanical understanding of what’s going on. I also understand many of the choices made for things like the attention scaling factor and why dot product attention is used instead of a dense layer.
But I do not actually understand why any of this works. In contrast, I was able to somewhat understand what CNNs are doing because we’re able to inspect what’s being detected at various layers. I could almost come up with a loose description of what I thought the learned function at different points might be. For the Transformer model I can’t really make sense of how these pieces fit together to predict text. What is a loose description of the function that is being learned in the feed forward part of the encoder? Is it different for each of the 6 invocations like in a CNN, or is it just distilling the same function over and over?
Could someone help me understand if I’m still just not familiar enough with the concepts to be able to get more than a mechanical understanding of what’s going on, or if models like Transformer are a little magical? I’m worried I’m in a cooking class, feverishly studying a recipe for bread and stressing out that although I can make a tasty loaf I don’t understand the chemistry behind it. Probably studying the bread recipe will never make me a chemist, so if this is my situation I should stop stressing out.