Week 2 Lab: Training configuration of the PEFT model

Hi,

I am currently taking the LLM course and have just finished week 2.

To make sure that I’ve assimilated the concepts covered in the second lab, particularly the LoRA method, I’ve redone the work in a Colab session.

At some point in the labwork, we load a model that has already been fine-tuned, so as to avoid a long training period (section 3.2).
Since I hadn’t downloaded this model, I decided to train my fine-tuned model myself (on the full training set), but I get very inferior performance to the labwork model.

Here’s my training configuration:
“”"
lora_config = LoraConfig(
r=8, # Rank
lora_alpha=32,
target_modules=[“q”, “v”],
lora_dropout=0.05,
bias=“none”,
task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

peft_model = get_peft_model(model, lora_config).to(‘cuda’)

os.environ[“WANDB_DISABLED”] = “true”

output_dir = ‘/content’
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
learning_rate=1e-3,
num_train_epochs=100,
logging_steps=10,
evaluation_strategy=“steps”,
eval_steps=10,
save_strategy=“steps”,
save_steps=10,
save_total_limit=2,
max_steps=100,
load_best_model_at_end=True
)

peft_trainer = Trainer(
model=peft_model,
args=training_args,
train_dataset=tokenized_dataset[“train”],
eval_dataset=tokenized_dataset[“validation”],
tokenizer=tokenizer,
)

peft_trainer.train()

final_model = peft_model.merge_and_unload()
“”"

Then I realise the inferences with the final model, but performance is often not as good as with the basic model, so I wanted to know if it came from my training configuration which may be very different from the one used by the instructors, or if I just missed something.

Thank you in advance and have a nice day.

Hi Clément, welcome to the community!

Here are a few observations and suggestions you can try to improve your results:

  1. Adjust your learning rate, e.g. 5e-5 or 1e-4, and your dropout values.
  2. Use only one of the num_train_epochs or max_steps.
  3. Experiment with extended target_modules (target_modules=["q", "k", "v", "o"] or other combinations) and higher rank values in the LoRA configuration. Similarly, experimenting with lora_alpha values around 16-64 may give better results.
  4. Monitor the training logs to track validation loss and performance metrics such as ROUGE or BLEU to detect early signs of underfitting or overfitting.

Keep experimenting and good luck!

Hi,

Thanks a lot for your quick and useful answer. I experimented by varying the parameters as you advised, but I still need to keep trying other combinations.
While training my peft model, I tried to set ROUGE as the evaluation metric to best monitor performance.

I’ve set my training as follows:

rouge = evaluate.load('rouge')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    predictions = predictions.detach().cpu().numpy()
    labels = labels.detach().cpu().numpy()
    # Remplacing-100 by PAD token to avoid decoding mistakes
    labels = [[(l if l != -100 else tokenizer.pad_token_id) for l in label] for label in labels]

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    rouge_scores = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    return {
        "rouge1": rouge_scores["rouge1"].mid.fmeasure,
        "rouge2": rouge_scores["rouge2"].mid.fmeasure,
        "rougeL": rouge_scores["rougeL"].mid.fmeasure,
    }

import os
os.environ['WANDB_DISABLED'] = 'true'
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=5e-5,
    num_train_epochs=3,
    logging_steps=10,
    evaluation_strategy="steps", 
    eval_steps=10,  
    save_strategy="steps",
    save_steps=10,
    save_total_limit=2,
    metric_for_best_model="rouge2",  
    greater_is_better=True  
)

peft_trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics, 
)

But with this configuration, I encounter a memory error message (probably due to the compute_metrics function, since I don’t get an overflow when I comment the line), even on an A100 GPU with 40GiB of RAM, even if I send the data to the CPU.

I apologize for sending raw code like this, but I was asking if I could modify something to track the ROUGE metrics during training, avoiding overflow.

Thank you once again, and I apologize if my questions are not in the right section of the forum.

Have an excellent day.

Glad to hear you’re making progress with your experiments! Memory issues during evaluation - especially when calculating metrics like ROUGE - are a common challenge when fine-tuning large models. Here are some suggestions:

  • Increase eval_steps to reduce the evaluation frequency. For example, eval_steps=50.
  • Use batch decoding in compute_metrics to avoid processing the whole data set at once.
  • Use gradient accumulation to simulate a larger batch size. For example, gradient_accumulation_steps=4.
  • Evaluate a subset of the validation data or use mixed-precision training (fp16=True) to reduce memory requirements further.

Hope this helps. Have a good day and happy experimenting!