Hello,
This is just to share this wonderful video that I found on Youtube that beautifully explains Self-Attention and Multi-Head Attention (based on the paper Attention is all You Need). It is 39 minutes long but makes not only the structure of the model but also the intuitions behind each structure in the model crystal clear. Here is the link should you wish to view it:
Self-Attention and Multi-Head Attention Video
Please note that the author of the video does mention that he doesn’t cover positional encoding or masking in the video, but he does in another video.
Hope you find this helpful, I know I did!
Melanie