Week 2 - Fine-tuning and LLM Evaluation in practice

  1. The course talks about fine-tuning to align with specific task / domain , then switch to multitask fine-tuning , but still specific tasks .
  2. LLM evaluation speaks about evaluation with ROUGE/BLEU and benchmarks.

I am trying to link 1 and 2 and come with these practical conclusions :

a. If I need to perform tuning I have to use ROUGE/BLEU scores or some metric on word embeddings to measure my progress
b. If I am inventing my own LLM I have to use benchmark
c. There is a possibility to use some benchmark if my domain similar to one is considered by some benchmark.

How it sounds to you ?

The metrics ROUGE and BLEU are there to help measure the performance of your model if available. Its up to you to use them or not but can be of help you.

Fine tuning is good but it may worsen model performance in other tasks and this will be discussed in the course too.