Reinforcement learning LLMs

I have a doubt… in the image attached, what is the reference model??

The reference model is the frozen-weights LLM original model which is compared with the Active LLM model which may use PEFT and PPO (reinforcement learning).

The idea is for the Active LLM not to diverge far away from the original LLM as it might become non-sensical even though it might please the reinforcement model quite well!