Clarification on LoRA

Gibbe · March 21, 2025, 3:50pm

Hi everyone,

I have a question regarding LoRA. I understand that the original weights are frozen, and that new weights will be trained and added to the frozen ones.
But instead of training W = [w i,j], a n.p matrix, one trains B and A, respectively of dimensions n.r and r.p.

I feel there an implicit reasoning like " Training(W) = Training(B.A) ~ Training(B) * Training(A)"

I’m writing “~” a I understand it’s not strictly equal (hence the importance of choosing r not too low), but still, I don’t understand this reasoning (which can be true, but under very strong hypothesis).

Where am I mistaken?

Thanks for your help!

PS : I have the same feeling with the “Training(frozen+W) ~ Training(frozen) + Training(W)”, althought I’m more enclined to understand it as it’s a sum. Altought training a model doesn’t seem very linear to me

Igor_Pereverzev · March 23, 2025, 9:30am

If i correct understand the question, so LoRA assumes that the adaptation to the pre-trained weights (ΔW) can be represented by a low-rank matrix. By the Eckart–Young theorem, the best rank “r” approximation of a matrix (in Frobenius norm) is its truncated SVD. LoRA implicitly learns such an approximation through gradient descent on “B” and “A”.

Why do we think about this, so training is not linear, but LoRA doesn’t assume linearity. Instead, it restricts the search space for ΔW to low-rank matrices, reducing parameters while preserving expressive power via composition (BA). The core assumption is that ΔW is approximately low-rank. This is a trade-off: small r saves memory/compute but might underfit. Larger r improves approximation at the cost of efficiency. LoRA works because low-rank updates are often sufficient for adaptation, and the product BA efficiently parameterizes these updates. The approximation quality depends on r , but the non-linear training process (via gradient descent on B and A ) compensates for the reduced rank. You’re not training W ; you’re training a compressed representation of its adaptation.

LoRA ≠ Training W Directly: It trains a low-rank surrogate for ΔW, not W itself.

Topic		Replies	Views
Week 2 Question 7 - Description of LoRA Method Generative AI with Large Language Models quiz-help , week-module-2	1	87	January 24, 2025
Fine tuning using LoRA method Generative AI with Large Language Models week-module-2	8	986	September 7, 2023
Can a LLM be approximated by products and addition of low-rank matrices? Generative AI with Large Language Models week-module-2	2	407	August 22, 2023
Clarification to Lora paper - finetuning alpha Generative AI with Large Language Models week-module-2	3	987	September 9, 2023
Small mistake in slides about LoRA regarding matrix multiplication Generative AI with Large Language Models week-module-2	1	451	July 6, 2023

Clarification on LoRA

Related topics