In the week 2 discussion of LoRA, we are shown how a 512x64 matrix of transformer weights is decomposed into the smaller, more easily tunable 512x8 and 8x64 matrices.
This all makes sense but it made me realize that it was never explained where the 512x64 dimensions came from in the first place! I realize that 512 is the length of each token’s embedding vector. Does that mean that 64 is the number of tokens, or the size of the context window?
This should be as follows:
The 512 comes from the model’s embedding size.
The 64 comes from splitting 512 across 8 attention heads.
It represents the per-head feature dimension, not the number of tokens or context window.
Thanks. So if theoretically we were to use just one attention head, then the transformer matrix would always be a square matrix of the same size as the embedding vectors?
If a transformer used one attention head, the projection matrices would typically be: 512×512, which are square matrices matching the embedding size.