Attempt to implement Multi Head Latent Attention (MLA) but failed. Seek for help

I recently tried to implement Multi Head Latent Attention (MLA), And after a while I realized it was a bit ambitious for me right now. I have tried to create one from scratch, but it did not work. I have tried using AI, but it still did not work.

The main issue is the loss decreases exponentially which is unusual:

Training Metrics
Epoch | Train Loss | Dev Loss | Perplexity | Data Size | ROUGE Score | LR | Time (s) | \ovo/

1   |   7.9534   |   7.3119   |   1498.09   |   141446   |   0.4762   |   0.000025   |   330.66
2   |   6.8160   |   6.4285   |   619.23   |   141446   |   0.7249   |   0.000050   |   309.49
3   |   6.1660   |   5.4457   |   231.75   |   141446   |   0.5580   |   0.000075   |   472.20
4   |   4.6556   |   3.2029   |   24.60   |   141446   |   0.3651   |   0.000100   |   498.01
5   |   3.2691   |   2.1602   |   8.67   |   141446   |   0.4607   |   0.000100   |   497.30
6   |   2.3786   |   1.7038   |   5.49   |   141446   |   0.3211   |   0.000100   |   497.08
7   |   1.9257   |   1.5462   |   4.69   |   141446   |   0.7205   |   0.000100   |   498.87

A generated output example:
User Prompt:
i am disappointed
base_3.5M (Genre: ai):
, (ed) back) back).).)ful)))).)).)).)))))))):))))))))less)))))))))))))ed))in)).ā€œ))))))))))))))))).:)))).))))) you characters)):)?)))ā€)

Generation settings:
temp = 1.0
top_k = 50
top_p = 0.9

I did a comparison using normal MHA, using same hyperparameter and same training data (storytelling tasks), and the result is different:

User Prompt:
i am happy,
Basic_4.1M (Genre: ai):
as well as her relationship with other kids, although in an attempt to prevent him from becoming involved. The novel opens with Pandas family arriving at a local diner and they meet up with Dolly, but they are soon approached by Tanga and his boss.
*btw, it is a pre-trained model, not fine tuned, so it is saying nonsense :wink:

After running additional experiments, I suspect the issue lies in my MLA code implementation rather than in hyperparameter tuning. I’m not yet an expert in Python, so I would greatly appreciate it if you could take a look at my code when you have time.

The paper for MLA: https://arxiv.org/pdf/2412.19437 (page 7-8)
Github link: GitHub - Lee-AI001/attempt_to_implement_multi-head-lantern-attention