Fine-tuning LLMs on TPU

Hi all,

I am trying to finetune Llama2-7b on Google TPU Pod. I am using huggingface Trainer for this, but the whole process distribution looks weird to me and I am not even sure if the process is correctly distributed. Has anyone ever tried doing the same? I am seeking for an advice from someone who is more experienced in this - I haven’t trained on TPUs specifically before so this is new for me.

Thanks!

Finetuning a large model like Llama2-7b on a Google TPU Pod using Huggingface’s Trainer can be complex. Here are a few steps and tips to ensure the process is correctly distributed and to troubleshoot any issues:

  1. Environment Setup:

    • Ensure you have the latest versions of transformers, datasets, and accelerate libraries.
    • Set up the TPU environment properly. Follow the instructions for setting up Google Cloud TPUs, including installing the TPU tools and configuring the environment.
  2. Configuration:

    • Use the TPUTrainer from Huggingface’s transformers library which is specifically designed for TPUs.

    • Ensure your training_args are set up to use TPUs. Here’s an example configuration:

      from transformers import Trainer, TrainingArguments
      
      training_args = TrainingArguments(
          output_dir="./results",
          per_device_train_batch_size=8,
          per_device_eval_batch_size=8,
          tpu_num_cores=8,  # Number of TPU cores
          save_steps=10_000,
          save_total_limit=2,
          learning_rate=5e-5,
          num_train_epochs=3,
          logging_steps=500,
          evaluation_strategy="steps",
          eval_steps=500,
      )
      
  3. Model and Data Preparation:

    • Load the Llama2-7b model and tokenizer from Huggingface:

      from transformers import AutoModelForCausalLM, AutoTokenizer
      
      model = AutoModelForCausalLM.from_pretrained("Llama2-7b")
      tokenizer = AutoTokenizer.from_pretrained("Llama2-7b")
      
    • Prepare your dataset using the datasets library and tokenize it properly:

      from datasets import load_dataset
      
      dataset = load_dataset("your_dataset_name")
      
      def tokenize_function(examples):
          return tokenizer(examples["text"], padding="max_length", truncation=True)
      
      tokenized_datasets = dataset.map(tokenize_function, batched=True)
      
  4. Trainer Initialization:

    • Initialize the Trainer with your model, training arguments, and dataset:

      from transformers import Trainer
      
      trainer = Trainer(
          model=model,
          args=training_args,
          train_dataset=tokenized_datasets["train"],
          eval_dataset=tokenized_datasets["test"],
      )
      
  5. Training:

    • Start the training process:

      trainer.train()
      
  6. Monitoring and Debugging:

    • Monitor the TPU utilization using Google Cloud’s monitoring tools to ensure the workload is distributed across all TPU cores.
    • Check the logs for any errors or warnings that might indicate issues with data loading, TPU utilization, or training process.
    • Use the logging_steps parameter to log progress at regular intervals and ensure the training is proceeding as expected.
  7. Validation:

    • Run validation steps during training to ensure the model is learning correctly and the training process is correctly utilizing the TPUs.

If you still encounter issues or the distribution doesn’t look right, you may need to delve deeper into TPU-specific optimizations or consider using the accelerate library from Huggingface, which provides a more fine-grained control over distributed training.

Here’s an example snippet for using accelerate:

from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

# Training loop
for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

Ensure you adapt the script to match your specific model and dataset.

2 Likes

Thanks for the insightful reply! I am trying to migrate training code from GPU to TPU and I use HuggingFace Trainer as you mentioned here. All looks good, except the fact that I see 8 logged runs in W&B (1 run per each worker/host) and I am not sure whether I am running 1 model training on all 8 workers at the time, or the the script is being run separately on each worker (training 8 different models). This is usually not the case for GPU, so I am not sure whether it is normal or not.

When training on TPUs, seeing 8 logged runs in Weights & Biases (W&B) is expected due to the way TPUs handle distributed training. Each TPU core (or worker) logs its own run, which is normal behavior. However, all these runs should be part of the same training process for a single model, not separate models. Here’s how to ensure everything is set up correctly:

  1. HuggingFace Trainer Configuration:
    • Ensure the TrainingArguments are configured correctly for TPU usage.
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=8,    # batch size for evaluation
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    tpu_num_cores=8,                 # Number of TPU cores to use
    report_to="wandb",               # Reporting to W&B
    run_name="my_tpu_training_run"   # Run name in W&B
)

trainer = Trainer(
    model=model,                         # The model to train
    args=training_args,                  # Training arguments
    train_dataset=train_dataset,         # Training dataset
    eval_dataset=eval_dataset            # Evaluation dataset
)
  1. Initialize TPU correctly:
    • Ensure that you are using torch_xla to initialize the TPU.
import torch_xla.core.xla_model as xm

def _mp_fn(rank, flags):
    # Training code here

    trainer.train()

if __name__ == '__main__':
    FLAGS = {}
    xm.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')
  1. Weights & Biases Configuration:
    • Make sure W&B is configured to handle distributed training correctly. In each worker, initialize W&B with the same run name, so it recognizes that all these logs belong to the same run.
import wandb

wandb.init(project="my_project", entity="my_entity", name="my_tpu_training_run", sync_tensorboard=True)
  1. Model Synchronization:
    • Ensure that the model weights are synchronized across all TPU cores. torch_xla typically handles this, but you can log model weights from each worker to verify.
import torch_xla.core.xla_model as xm

def log_model_state(model):
    print(xm.get_ordinal(), model.state_dict()['some_layer.weight'][:5])

log_model_state(model)
  1. Monitoring:
    • Check your W&B dashboard to see if the runs are part of the same overall run or sweep.

By following these steps, you should be able to confirm that you are indeed training a single model across all TPU workers, and the multiple logged runs in W&B are just a result of distributed logging.

1 Like

Thank you very much, very good tips :+1:

1 Like

Hey, thanks for your reply—it helped so much!

I found a small error in the second step, Initialize TPU correctly.

the correct code is :

import torch_xla.distributed.xla_multiprocessing as xmp

def _mp_fn(rank, flags):
    # Training code here

    trainer.train()

if __name__ == '__main__':

FLAGS = {}
    xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

Thanks again!