Is there a problem with GRPO loss computation?

doductai · July 3, 2025, 3:23am

In lesson 7, for the loss computation, the instructor computed the ratio pi_theta/pi_ref, which is different from the original version in the paper pi_theta/pi_old.

In this way, we only use 2 models: ref model (which is frozen) and lora model. As a result, the lora model tries to optimize the answers that are generated by the reference model. This is limited because we only try to get the best out of the base model.

However, it would be more reasonable to use pi_theta/pi_old, which means keep updating the lora models over every iteration?

Topic		Replies	Views
Why does not anyone apply GRPO fine tuning on a GRPO fine tuned model Reinforcement Fine-Tuning LLMs with GRPO	2	107	May 22, 2025
Creating the ref model in lab3 for the RLHF algorithm Generative AI with Large Language Models lab-help	1	49	October 19, 2024
Module 2: "RL: PPO and GRPO Algorithms" Slide Error Fine-tuning & RL for LLMs: Intro to Post-training week-module-2 , dl-ai-learning-platform	0	45	November 23, 2025
Is cold start SFT always necessary before GRPO Reinforcement Fine-Tuning LLMs with GRPO	0	226	May 22, 2025
LoRA compared to full fine tuning Generative AI with Large Language Models course-related	1	76	October 8, 2025

Is there a problem with GRPO loss computation?

Related topics