Why do we need PEFT while trying to avoid reward hacking that occurs during RLHF? Is it for retraining the reference model or the RL (PPO)?
1 Like
PEFT is used for the LLM model not RL. PEFT also improves the model performance as the training goes on.
Thanks @gent.spah. What made me confused is the following diagram in our lecture note? It’s not clear what the PEFT adapter is for.
The PPO steers the PEFT adapter weights in a proper direction so it doesnt bias.
That makes sense. Thanks! @gent.spah
1 Like