Why does not anyone apply GRPO fine tuning on a GRPO fine tuned model

why we don’t apply RL fine tuning for a new task on already RL fine tuned model? what are the caveats?

If you have already travelled from city A to city B and now want to travel to city C, but city C is closer to A than B, then it’s better to start from city A, right?

The same analogy can be used to LLM models when fine-tuning.

i would like to present a scenario:
if i had a math related problem and i RLFT a model, now i have another math problem that is somewhat different, what makes sense to me is that since the RLFT model now is better at math problems i should apply RL again and get better results.

In this case city C would be closer to city A.

But if you examine research papers there is nobody doing this, and if you think of an LLM as a search space, then RL just narrows it down, and in that narrowed space if i have a similar problem i can get better even faster via RLFT.

A counter example i can think of is that if this was true then the reasoning models now wouldn’t be so bad right now, if you could stack RLFT and make a model better and better wouldn’t this be great.

I am just unable to map this to LLMs like why RL over RL is bad, even for a problem that is similar, the authors usually do 1 SFT then 1 RL round then 1 SFT then 1 RL round. The Qwen3 models were trained like this.