Fine-tuning LLMs on TPU

equin_x · July 9, 2024, 12:34am

Hi all,

I am trying to finetune Llama2-7b on Google TPU Pod. I am using huggingface Trainer for this, but the whole process distribution looks weird to me and I am not even sure if the process is correctly distributed. Has anyone ever tried doing the same? I am seeking for an advice from someone who is more experienced in this - I haven’t trained on TPUs specifically before so this is new for me.

Thanks!

vdt · July 12, 2024, 9:42am

Finetuning a large model like Llama2-7b on a Google TPU Pod using Huggingface’s Trainer can be complex. Here are a few steps and tips to ensure the process is correctly distributed and to troubleshoot any issues:

Environment Setup:
- Ensure you have the latest versions of transformers, datasets, and accelerate libraries.
- Set up the TPU environment properly. Follow the instructions for setting up Google Cloud TPUs, including installing the TPU tools and configuring the environment.

Configuration:

Use the TPUTrainer from Huggingface’s transformers library which is specifically designed for TPUs.

Ensure your training_args are set up to use TPUs. Here’s an example configuration:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    tpu_num_cores=8,  # Number of TPU cores
    save_steps=10_000,
    save_total_limit=2,
    learning_rate=5e-5,
    num_train_epochs=3,
    logging_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
)

Model and Data Preparation:

Load the Llama2-7b model and tokenizer from Huggingface:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Llama2-7b")
tokenizer = AutoTokenizer.from_pretrained("Llama2-7b")

Prepare your dataset using the datasets library and tokenize it properly:

from datasets import load_dataset

dataset = load_dataset("your_dataset_name")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Trainer Initialization:

Initialize the Trainer with your model, training arguments, and dataset:

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

Training:
- Start the training process:
```
trainer.train()
```
Monitoring and Debugging:
- Monitor the TPU utilization using Google Cloud’s monitoring tools to ensure the workload is distributed across all TPU cores.
- Check the logs for any errors or warnings that might indicate issues with data loading, TPU utilization, or training process.
- Use the logging_steps parameter to log progress at regular intervals and ensure the training is proceeding as expected.
Validation:
- Run validation steps during training to ensure the model is learning correctly and the training process is correctly utilizing the TPUs.

If you still encounter issues or the distribution doesn’t look right, you may need to delve deeper into TPU-specific optimizations or consider using the accelerate library from Huggingface, which provides a more fine-grained control over distributed training.

Here’s an example snippet for using accelerate:

from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

# Training loop
for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

Ensure you adapt the script to match your specific model and dataset.

equin_x · July 12, 2024, 12:36pm

Thanks for the insightful reply! I am trying to migrate training code from GPU to TPU and I use HuggingFace Trainer as you mentioned here. All looks good, except the fact that I see 8 logged runs in W&B (1 run per each worker/host) and I am not sure whether I am running 1 model training on all 8 workers at the time, or the the script is being run separately on each worker (training 8 different models). This is usually not the case for GPU, so I am not sure whether it is normal or not.

vdt · July 12, 2024, 1:23pm

When training on TPUs, seeing 8 logged runs in Weights & Biases (W&B) is expected due to the way TPUs handle distributed training. Each TPU core (or worker) logs its own run, which is normal behavior. However, all these runs should be part of the same training process for a single model, not separate models. Here’s how to ensure everything is set up correctly:

HuggingFace Trainer Configuration:
- Ensure the TrainingArguments are configured correctly for TPU usage.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=8,    # batch size for evaluation
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    tpu_num_cores=8,                 # Number of TPU cores to use
    report_to="wandb",               # Reporting to W&B
    run_name="my_tpu_training_run"   # Run name in W&B
)

trainer = Trainer(
    model=model,                         # The model to train
    args=training_args,                  # Training arguments
    train_dataset=train_dataset,         # Training dataset
    eval_dataset=eval_dataset            # Evaluation dataset
)

Initialize TPU correctly:
- Ensure that you are using torch_xla to initialize the TPU.

import torch_xla.core.xla_model as xm

def _mp_fn(rank, flags):
    # Training code here

    trainer.train()

if __name__ == '__main__':
    FLAGS = {}
    xm.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

Weights & Biases Configuration:
- Make sure W&B is configured to handle distributed training correctly. In each worker, initialize W&B with the same run name, so it recognizes that all these logs belong to the same run.

import wandb

wandb.init(project="my_project", entity="my_entity", name="my_tpu_training_run", sync_tensorboard=True)

Model Synchronization:
- Ensure that the model weights are synchronized across all TPU cores. torch_xla typically handles this, but you can log model weights from each worker to verify.

import torch_xla.core.xla_model as xm

def log_model_state(model):
    print(xm.get_ordinal(), model.state_dict()['some_layer.weight'][:5])

log_model_state(model)

Monitoring:
- Check your W&B dashboard to see if the runs are part of the same overall run or sweep.

By following these steps, you should be able to confirm that you are indeed training a single model across all TPU workers, and the multiple logged runs in W&B are just a result of distributed logging.

equin_x · July 12, 2024, 7:41pm

Thank you very much, very good tips

Oussama_QOUTI · August 28, 2024, 8:12am

Hey, thanks for your reply—it helped so much!

I found a small error in the second step, Initialize TPU correctly.

the correct code is :

import torch_xla.distributed.xla_multiprocessing as xmp

def _mp_fn(rank, flags):
    # Training code here

    trainer.train()

if __name__ == '__main__':

FLAGS = {}
    xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

Thanks again!

Topic		Replies	Views
meta-llama/Llama-2-7b-chat-hf Generative AI with Large Language Models week-1	2	599	October 31, 2023
Llama3.2 from Huggingface in Google Colab AI Discussions ai-discussions	6	324	November 7, 2024
Week3 - I have just completed the course, excited to put my knowledge into practice! Generative AI with Large Language Models week-1	2	42	October 15, 2024
Dataset for fine tuning SLM AI Discussions data-centric	0	18	March 28, 2025
Fine-tuning using PEFT/LORA for GPT-3 - how? Generative AI with Large Language Models week-2	3	829	July 13, 2023

Fine-tuning LLMs on TPU

Related topics