Inquiry Regarding Out of Memory Issue During LoRA Fine-Tuning

bensonbbn · May 4, 2025, 10:14pm

I am a student currently working on training the LLAMA-4-Scout-17B-16E-Instruct model using LoRA, running on an H100 GPU with 80GB VRAM (on Lambda Labs). However, I have encountered an out of memory error during the training process. I understand that this might fall slightly outside the scope of the course, but despite extensive research and reviewing various community discussions, I have not been able to resolve the issue.

Here is a brief outline of my setup:

Hardware: H100 (80GB VRAM)

Model: LLAMA-4-Scout-17B-16E-Instruct (download on unsloth hugging face)

Training Method: LoRA

Error: CUDA out of memory

Code snippet:
import torch
from transformers import AutoTokenizer,TrainingArguments,Trainer,DataCollatorForLanguageModeling,AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from accelerate import dispatch_model
from accelerate import Accelerator
from accelerate.utils import get_balanced_memory, infer_auto_device_map
import os
os.environ[“PYTORCH_CUDA_ALLOC_CONF”] = “expandable_segments:True”

model_path = “/home/ubuntu/llama4”
dataset_path = “llama_nc_instruction_train.jsonl”
output_dir = “./merged_llama4_nccode”

print(“ loading tokenizer…”)
tokenizer = AutoTokenizer.from_pretrained(model_path)

print(“ loading model…（使用 safetensors）”)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
)

print(“ applying LoRA setting…”)
lora_config = LoraConfig(
r=8,
lora_alpha=32, #有人用8
target_modules=[“q_proj”, “v_proj”],
lora_dropout=0.05,
bias=“none”,
task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)

print(“ loading data…”)
dataset = load_dataset(“json”, data_files=dataset_path, split=“train”)

def tokenize(example):
tokenized_inputs = tokenizer(
example[“text”],
truncation=True,
padding=“max_length”,
max_length=4196
)
return tokenized_inputs

tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=[“text”])

print(“ establish Trainer…”)
training_args = TrainingArguments(
output_dir=“./lora_tmp”,
num_train_epochs=3,
per_device_train_batch_size=1, #有人用64
gradient_accumulation_steps=512,
learning_rate=2e-4,
logging_steps=10,
save_strategy=“no”,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

print(“ training…”)
trainer.train()

print(“ merge LoRA weight…”)
model = model.merge_and_unload()

print(“ save model to:”, output_dir)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(“ finish！”)

and this is the error:

載入 tokenizer…
載入模型…（使用 safetensors）
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 50/50 [00:00<00:00, 457.56it/s]
套用 LoRA 設定…
載入資料中…
建立 Trainer…
/home/ubuntu/CNC代碼定義訓練黨TEST.py:68: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Trainer.init. Use processing_class instead.
trainer = Trainer(
Traceback (most recent call last):
File “/home/ubuntu/CNC代碼定義訓練黨TEST.py”, line 68, in
trainer = Trainer(
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/utils/deprecation.py”, line 172, in wrapped_func
return func(*args, **kwargs)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/trainer.py”, line 614, in init
self._move_model_to_device(model, args.device)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/transformers/trainer.py”, line 901, in _move_model_to_device
model = model.to(device)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1355, in to
return self._apply(convert)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 915, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 942, in _apply
param_applied = fn(param)
File “/home/ubuntu/llama_env/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1341, in convert
return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 79.19 GiB of which 359.06 MiB is free. Including non-PyTorch memory, this process has 78.83 GiB memory in use. Of the allocated memory 78.38 GiB is allocated by PyTorch, and 8.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.7 documentation)

Would anyone kindly offer any suggestions or best practices to address this issue? Are there specific parameters I should consider adjusting (e.g., batch size, gradient checkpointing, LoRA rank, etc.) to make it fit within the memory constraints?

lukmanaj · May 5, 2025, 4:45pm

Welcome to the community @bensonbbn, and thanks for the detailed breakdown of your setup — that really helps.

You’re training a 17B parameter model (LLAMA-4-Scout-17B-16E-Instruct) with LoRA on an H100 (80GB VRAM), and encountering a CUDA out of memory error during initialization or training. This is not uncommon with large models, even on high-memory GPUs.

Here are some practical suggestions and code-level adjustments to help you move forward:

Key Fixes to Try

1. Enable Gradient Checkpointing

This reduces memory usage by recomputing activations during the backward pass:

model.gradient_checkpointing_enable()

Place this after applying get_peft_model(...) and before passing the model to the Trainer.

2. Use 4-bit Quantization with `bitsandbytes`

This can significantly reduce memory usage.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto"
)

Make sure bitsandbytes is installed: pip install bitsandbytes.

3. Reduce `max_length` in Tokenizer

You’re using max_length=4196, which is very high. Reducing this lowers memory pressure:

max_length = 2048  # Or even try 1024, it depends

4. Lower `gradient_accumulation_steps`

You have:

gradient_accumulation_steps=512

Try reducing it to:

gradient_accumulation_steps=64  # Or 128

This will directly reduce memory usage during training.

5. Use `accelerate` to Dispatch Model Efficiently

Instead of moving the model directly to .to(device), you can offload parts of the model:

from accelerate import infer_auto_device_map, dispatch_model, get_balanced_memory

max_memory = get_balanced_memory(model, no_split_module_classes=["LlamaDecoderLayer"])
device_map = infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
model = dispatch_model(model, device_map=device_map)

6. Adjust LoRA Configuration

Consider using a lower rank or fewer modules to save memory:

lora_config = LoraConfig(
    r=4,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

7. Set CUDA Alloc Environment Variable

Helps with memory fragmentation issues:

export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

Or in Python:

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "garbage_collection_threshold:0.6,max_split_size_mb:128"

Finally

Use nvidia-smi to monitor VRAM usage before and after model loading. It’s a great way to catch bottlenecks early.

Let us know how it goes — happy to help further if needed.

bensonbbn · May 9, 2025, 3:09pm

Dear Sir,

Thank you so much for your clear instructions and for taking the time to correct my code!

I followed your suggestions, and I’m happy to report that it worked perfectly. Your guidance truly made a difference, and I sincerely appreciate your support and kindness.

Thank you once again for your generous help.

Best regards,

lukmanaj · May 9, 2025, 4:01pm

You are highly welcome. I am glad you were able to make it work.

Topic		Replies	Views
Cuda out of memory error during PEFT Generative AI with Large Language Models week-2	0	484	November 1, 2023
Fine tune the mode on GPU Generative AI with Large Language Models week-2	4	660	August 3, 2023
Llama3.2 from Huggingface in Google Colab AI Discussions ai-discussions	6	341	November 7, 2024
RuntimeError: CUDA error: out of memory Build Basic Generative Adversarial Networks week-2 , week-3	11	781	July 18, 2022
CUDA error: out of memory Build Basic Generative Adversarial Networks week-3	1	314	August 13, 2022

Inquiry Regarding Out of Memory Issue During LoRA Fine-Tuning

Key Fixes to Try

1. Enable Gradient Checkpointing

2. Use 4-bit Quantization with bitsandbytes

3. Reduce max_length in Tokenizer

4. Lower gradient_accumulation_steps

5. Use accelerate to Dispatch Model Efficiently