Welcome to the community, @myacoob!
BLEU and ROUGE an n-gram metrics which are definitely powerful, but they are not the silver bullet for everything as you pointed out.
Here you find a find a nice paper on the evaluation of word embeddings. Feel free to take a look at it:
Additionally: it might also be a good strategy to get familiar with standard data science evaluation metrics like Kullback–Leibler divergence - sometimes also called relative entropy. It is often used in NLP, helping to compare how similar distributions are, see also this thread: Measurement of diversity - #4 by Christian_Simonis. Good thing is: you will explore that in week 3 of the LLM course.
Hope that helps!
Best regards
Christian