Cuda out of memory error during PEFT

I’m trying to fine tune the model weights from a FLAN-T5 model downloaded from hugging face. I’m trying to do this with PEFT and specifically LoRA. I’m using the python 3 code below. I’m running this on ubuntu server 18.04LTS with an invidia gpu that has 8GB of ram. I’m getting an error “CUDA out of memory”, the full error message is below. I’ve tried adding:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

but I’m still getting the same error message. The code and error message are below. Can anyone see what the issue might be and suggest how to solve it?


from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

# added to deal with memory allocation error
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

# ### Load Dataset and LLM

huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)


# Load the pre-trained [FLAN-T5 model]( and its tokenizer directly from HuggingFace. Using the [small version]( of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

# In[17]:


original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.



inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(

dash_line = '-'.join('' for x in range(100))

# updated 11/1/23 to ensure using gpu
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids\
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids\

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets =, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])

# To save some time subsample the dataset:

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

# Check the shapes of all three parts of the dataset:

# In[7]:

# print(f"Shapes of the datasets:")
# print(f"Training: {tokenized_datasets['train'].shape}")
# print(f"Validation: {tokenized_datasets['validation'].shape}")
# print(f"Test: {tokenized_datasets['test'].shape}")
# print(tokenized_datasets)

# The output dataset is ready for fine-tuning.

# ### Perform Parameter Efficient Fine-Tuning (PEFT)
# - use LoRA

# ### Setup the PEFT/LoRA model for Fine-Tuning
# - set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter
# - freezing the underlying LLM and only training the adapter
# - LoRA configuration below
# - Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained

# In[8]:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
#     r=4, # Rank
#     lora_alpha=4,
    r=32, # Rank
    target_modules=["q", "v"],
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5


# Add LoRA adapter layers/parameters to the original LLM to be trained.

# In[9]:

peft_model = get_peft_model(original_model,
# print(print_number_of_trainable_model_parameters(peft_model))

# Enable gradient checkpointing in the model's configuration.
# peft_model.config.gradient_checkpointing = True

# ### Train PEFT Adapter
# Define training arguments and create `Trainer` instance.

# In[10]:

output_dir = f'/home/username/stuff/username_storage/LLM/PEFT/train_args/no_log_max_depth_{str(int(time.time()))}'

peft_training_args = TrainingArguments(
#     auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
#     max_steps=1

peft_trainer = Trainer(

# In[11]:





return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 1.10 GiB already allocated; 17.31 MiB free; 1.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|          | 0/32 [00:00<?, ?it/s]