Clarification to Lora paper - finetuning alpha

jpotwor · August 14, 2023, 8:45am

Hi Everyone!

I was reading the LORA paper and the only thing I don’t understand is section 4.1, where the updates are updated by alpha, where alpha is a constant in r. It is said that alpha is set to the first r tried. Then if I understand correctly, the authors say that this makes it unnecesary to tune the learning rate alpha. I would really appreciate if someone could explain this concept to me. To start with I don’t understand, why the need to scale the weight update by a constant. I mean, all the weights of the updates are optimized in the fine tuning process.

I also wanted to understand why is A initialized randomly and B to zero. Would it make a difference if it would be the other way around (A zero, B random?). Also, what would go wrong if both would be set to zero?

Best!
Jan

Atharva_Divekar · August 15, 2023, 5:54pm

In section 4.1 of the LORA paper, alpha is a constant that scales the weight update. The value of alpha is set to the first r tried, which eliminates the need to tune the learning rate alpha. Scaling the weight update is necessary because the weights of the updates are optimized in the fine-tuning process, but the learning rate determines how much of an effect each update has on the weights.

Regarding the initialization of A and B, A is initialized randomly because it is used for feature extraction, while B is initialized to zero because it is used for label prediction. It would not make a significant difference if A were initialized to zero and B were initialized randomly. However, if both A and B were initialized to zero, the model would not be able to learn anything, as all predictions would be zero.

jpotwor · August 16, 2023, 9:45am

Thank you @Atharva_Divekar!

However, I still need more clarity Usually, the learning rate has to be tuned in the learning process. However here, they say they just set it to the first r tried. And this is the part I don’t understand. What is the process there exactly?

The authors write:

“We then scale ∆W x by α r , where α
is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning rate if we scale the initialization appropriately. As a result, we simply set α to the first r we try and do not tune it. This scaling helps to reduce the need to retune hyperparameters when we vary r”

I see at least a couple of ways of interpreting this:
-) They tune alpha in adam once when they use r=1 and for bigger r’s they use alpha/r
-) They set alpha to 1 and then for bigger r’s they use alpha/r

is either of these correct?

Atharva_Divekar · September 9, 2023, 6:32pm

Based on the text you provided, the authors set the value of alpha to the first r they try and do not tune it. They then scale the weight update by alpha times r, where alpha is a constant in r. This scaling helps to reduce the need to retune hyperparameters when they vary r. Therefore, the second interpretation you provided is correct according to me: they set alpha to 1 and then for bigger r’s they use alpha/r.

Topic		Replies	Views
Clarification on LoRA Generative AI with Large Language Models week-module-2	1	26	March 23, 2025
Week 2 Question 7 - Description of LoRA Method Generative AI with Large Language Models quiz-help , week-module-2	1	62	January 24, 2025
Advice for selecting the right value for alpha in Gradient Descent Supervised ML: Regression and Classification week-module-3	5	1563	January 7, 2023
Optional Lab: C1_W2_Lab03_Feature_Scaling_and_Learning_Rate_Soln Supervised ML: Regression and Classification week-module-2	2	512	October 10, 2022
Fine tuning using LoRA method Generative AI with Large Language Models week-module-2	8	947	September 7, 2023

Clarification to Lora paper - finetuning alpha

Related topics