There might be one error regarding the loss function in the slice on page 21

The loss function of the reward model seems to lack of a negative sign. I’ve checked the new version of the paper, and it also fixed this. Check the v3 version of the paper

Agree with you