PEFT during avoidance of reward hacking

Why do we need PEFT while trying to avoid reward hacking that occurs during RLHF? Is it for retraining the reference model or the RL (PPO)?

1 Like

PEFT is used for the LLM model not RL. PEFT also improves the model performance as the training goes on.

Thanks @gent.spah. What made me confused is the following diagram in our lecture note? It’s not clear what the PEFT adapter is for.
image

The PPO steers the PEFT adapter weights in a proper direction so it doesnt bias.

That makes sense. Thanks! @gent.spah

1 Like