Hi Everyone!
I was reading the LORA paper and the only thing I don’t understand is section 4.1, where the updates are updated by alpha, where alpha is a constant in r. It is said that alpha is set to the first r tried. Then if I understand correctly, the authors say that this makes it unnecesary to tune the learning rate alpha. I would really appreciate if someone could explain this concept to me. To start with I don’t understand, why the need to scale the weight update by a constant. I mean, all the weights of the updates are optimized in the fine tuning process.
I also wanted to understand why is A initialized randomly and B to zero. Would it make a difference if it would be the other way around (A zero, B random?). Also, what would go wrong if both would be set to zero?
Best!
Jan
In section 4.1 of the LORA paper, alpha is a constant that scales the weight update. The value of alpha is set to the first r tried, which eliminates the need to tune the learning rate alpha. Scaling the weight update is necessary because the weights of the updates are optimized in the fine-tuning process, but the learning rate determines how much of an effect each update has on the weights.
Regarding the initialization of A and B, A is initialized randomly because it is used for feature extraction, while B is initialized to zero because it is used for label prediction. It would not make a significant difference if A were initialized to zero and B were initialized randomly. However, if both A and B were initialized to zero, the model would not be able to learn anything, as all predictions would be zero.
Thank you @Atharva_Divekar!
However, I still need more clarity
Usually, the learning rate has to be tuned in the learning process. However here, they say they just set it to the first r tried. And this is the part I don’t understand. What is the process there exactly?
The authors write:
“We then scale ∆W x by α r , where α
is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning rate if we scale the initialization appropriately. As a result, we simply set α to the first r we try and do not tune it. This scaling helps to reduce the need to retune hyperparameters when we vary r”
I see at least a couple of ways of interpreting this:
-) They tune alpha in adam once when they use r=1 and for bigger r’s they use alpha/r
-) They set alpha to 1 and then for bigger r’s they use alpha/r
is either of these correct?
Based on the text you provided, the authors set the value of alpha to the first r they try and do not tune it. They then scale the weight update by alpha times r, where alpha is a constant in r. This scaling helps to reduce the need to retune hyperparameters when they vary r. Therefore, the second interpretation you provided is correct according to me: they set alpha to 1 and then for bigger r’s they use alpha/r.