- The course talks about fine-tuning to align with specific task / domain , then switch to multitask fine-tuning , but still specific tasks .
- LLM evaluation speaks about evaluation with ROUGE/BLEU and benchmarks.
I am trying to link 1 and 2 and come with these practical conclusions :
a. If I need to perform tuning I have to use ROUGE/BLEU scores or some metric on word embeddings to measure my progress
b. If I am inventing my own LLM I have to use benchmark
c. There is a possibility to use some benchmark if my domain similar to one is considered by some benchmark.
How it sounds to you ?