Hi all,
in week 2 lab, [2.2 - Fine-Tune the Model with the Preprocessed Dataset], it is written that fully fine-tuning the model would take a few hours on a GPU (“Training a fully fine-tuned version of the model would take a few hours on a GPU. To save time, download a checkpoint of the fully fine-tuned model to use in the rest of this notebook. This fully fine-tuned model will also be referred to as the instruct model in this lab.”)
Now, I’m still trying to do it on my computer with a RTX 3060 mobile GPU, which forces me to use parameter per_device_train_batch_size=4. Therefore my code looks like this:
output_dir = f’./dialogue-summary-training-{str(int(time.time()))}’
training_args = TrainingArguments(
output_dir=output_dir,
learning_rate=1e-5,
# num_train_epochs=1,
weight_decay=0.01,
logging_steps=1,
save_steps=1000,
max_steps=10000,
per_device_train_batch_size=4)
trainer = Trainer(
model=original_model,
args=training_args,
train_dataset=tokenized_datasets[‘train’],
eval_dataset=tokenized_datasets[‘validation’])
It seems like the training does not converge and the loss does not go lower than 28 after 4000 steps. Is it possible for someone to share the set of parameters that would yield the training to reach a satisfactory level please?
Thank you very much!
I dont think those parameters are shared in the course publicly…
Thank you.
Even though this is not shared within the course, I was wondering if anyone would be able to guide me in trying to find them. What would be the typical process to follow to get to a satisfying set of parameters that allows for this kind of fine-tuning?
1. Understand the Problem & Baseline
- Start with defaults : Use the hyperparameters from the pre-trained model or standard values (e.g., learning rate =
1e-5
to 1e-3
, batch size = 16
–64
).
- Baseline performance : Measure the model’s performance on your validation set without any tuning . This sets a reference point.
2. Identify Key Hyperparameters
Focus on parameters that significantly impact fine-tuning. Common ones include:
- Learning rate : Often the most critical. Start low (e.g.,
1e-5
) to avoid catastrophic forgetting.
- Batch size : Smaller batches (e.g., 8–32) can improve generalization; larger batches speed up training.
- Epochs : Use early stopping to prevent overfitting. Start with 3–10 epochs.
- Optimizer : AdamW (common for transformers) vs. SGD with momentum.
- Weight decay : Regularization strength (e.g.,
0.01
).
- Warmup steps : Gradually increase the learning rate at the start of training.
- Layer freezing : Freeze some layers initially, then unfreeze progressively (e.g., unfreeze the top 1–2 layers first).
- Dropout : Adjust if overfitting (e.g.,
0.1
–0.3
in transformer models).
3. Choose a Search Strategy
- Manual Search : Start with coarse adjustments (e.g., try learning rates
1e-5
, 1e-4
, 1e-3
).
- Grid Search : Exhaustive search over predefined ranges (e.g., learning rate ∈
[1e-5, 5e-5, 1e-4]
). Computationally expensive.
- Random Search : Randomly sample hyperparameters (better than grid search for high-dimensional spaces).
- Bayesian Optimization : Efficiently explores promising regions (tools: Optuna , Hyperopt ).
- Automated Tools : Use frameworks like
transformers.Trainer
(Hugging Face) with built-in hyperparameter search.
4. Validate & Iterate
- Validation set : Always use a hold-out validation set (or cross-validation) to evaluate performance.
- Early stopping : Monitor validation loss and stop training if it plateaus (patience = 3–5 epochs).
- Track experiments : Use tools like Weights & Biases or TensorBoard to log results and compare runs.
5. Fine-Tuning Best Practices
- Gradual unfreezing : Start by fine-tuning only the top layers, then unfreeze deeper layers incrementally.
- Discriminative learning rates : Use lower learning rates for earlier layers (closer to the input) and higher rates for later layers.
- Avoid overfitting : If validation performance degrades, reduce model complexity (e.g., freeze more layers) or increase regularization.
6. Example Workflow
- Start with default parameters and a small number of epochs.
- Perform a coarse random search over learning rate, batch size, and weight decay.
- Refine the search around the best-performing values.
- Use early stopping and monitor validation metrics.
- Gradually unfreeze layers and repeat the process.
7. Domain-Specific Considerations
- NLP : For transformer models (e.g., BERT), smaller learning rates (
1e-5
to 3e-5
) and batch sizes (16
–32
) often work well.
- Computer Vision : Unfreeze convolutional layers cautiously and use data augmentation to combat overfitting.
Tools to Simplify the Process
- Hugging Face
Trainer
: Built-in hyperparameter search with optuna
or ray
.
- PyTorch Lightning : Integrates with Optuna/Hyperopt for automated tuning.
- Keras Tuner : For Keras-based workflows.
1 Like
Thank you very much, I will try that next!