I recently tried to implement Multi Head Latent Attention (MLA), And after a while I realized it was a bit ambitious for me right now. I have tried to create one from scratch, but it did not work. I have tried using AI, but it still did not work.
The main issue is the loss decreases exponentially which is unusual:
Training Metrics
Epoch | Train Loss | Dev Loss | Perplexity | Data Size | ROUGE Score | LR | Time (s) | \ovo/
1 | 7.9534 | 7.3119 | 1498.09 | 141446 | 0.4762 | 0.000025 | 330.66
2 | 6.8160 | 6.4285 | 619.23 | 141446 | 0.7249 | 0.000050 | 309.49
3 | 6.1660 | 5.4457 | 231.75 | 141446 | 0.5580 | 0.000075 | 472.20
4 | 4.6556 | 3.2029 | 24.60 | 141446 | 0.3651 | 0.000100 | 498.01
5 | 3.2691 | 2.1602 | 8.67 | 141446 | 0.4607 | 0.000100 | 497.30
6 | 2.3786 | 1.7038 | 5.49 | 141446 | 0.3211 | 0.000100 | 497.08
7 | 1.9257 | 1.5462 | 4.69 | 141446 | 0.7205 | 0.000100 | 498.87
A generated output example:
User Prompt:
i am disappointed
base_3.5M (Genre: ai):
, (ed) back) back).).)ful)))).)).)).)))))))):))))))))less)))))))))))))ed))in)).ā))))))))))))))))).:)))).))))) you characters)):)?)))ā)
Generation settings:
temp = 1.0
top_k = 50
top_p = 0.9
I did a comparison using normal MHA, using same hyperparameter and same training data (storytelling tasks), and the result is different:
User Prompt:
i am happy,
Basic_4.1M (Genre: ai):
as well as her relationship with other kids, although in an attempt to prevent him from becoming involved. The novel opens with Pandas family arriving at a local diner and they meet up with Dolly, but they are soon approached by Tanga and his boss.
*btw, it is a pre-trained model, not fine tuned, so it is saying nonsense
After running additional experiments, I suspect the issue lies in my MLA code implementation rather than in hyperparameter tuning. Iām not yet an expert in Python, so I would greatly appreciate it if you could take a look at my code when you have time.
The paper for MLA: https://arxiv.org/pdf/2412.19437 (page 7-8)
Github link: GitHub - Lee-AI001/attempt_to_implement_multi-head-lantern-attention