Is there a problem with GRPO loss computation?

In lesson 7, for the loss computation, the instructor computed the ratio pi_theta/pi_ref, which is different from the original version in the paper pi_theta/pi_old.

In this way, we only use 2 models: ref model (which is frozen) and lora model. As a result, the lora model tries to optimize the answers that are generated by the reference model. This is limited because we only try to get the best out of the base model.

However, it would be more reasonable to use pi_theta/pi_old, which means keep updating the lora models over every iteration?