LoRA: intuition w.r.t catastrophic forgetting

PEFT was discussed as a one of the considerations to mitigate catastrophic forgetting problem.
Intuitively I am completely in sync with the scaling benefits (computational efficiency…) of PEFT.
However, I couldn’t follow the intuition behind how or why LoRA should mitigate catastrophic forgetting issue. Though we are training 2 low rank matrices with LoRA, we are multiplying them to get a same shape matrix as the original weight matrix, and then for self-attention we are using original weight matrix + (A * B). Now A * B doesn’t have any sort of skew properties, therefore the resulting matrix that we are going to apply for inference has potentially everything in it changed. So why should it be any better than the full-finetuned weights w.r.t. catastrophic forgetting?

The A*B matrix will be an add on into the model weights not permanently residing in the weights memory. When you would want to use the LLM for another task you shall remove the lora and it will come to its original use.

1 Like

Thank you for your response!

I was thinking the same.
However, in a practical LLM application this (decision of LoRA / No-LoRA) would demand:

  • either a pre-classification LLM call to decide LoRA / No-LoRA
  • or a second (No-LoRA) call (on somehow realizing that the first call result wasn’t as expected)

This (pre-classification / post-No-LoRA-call) strategy is equally applicable for full fine-tuned models as well. Doesn’t that reduce the benefits to scaling and computational efficiency?

If I understand right you want to automate wether to use LoRA or original model itself. I would supposed that could be done in some way.

The benefits of LoRA is that you don’t change the original model, nor you use heavy computational power to fine tune it, neither you forget the previous learned tasks. Fine tuning is definitely a better way to tune to a certain task.

Other than that you could possibly devise any other usage mechanisms.