The loss function of the reward model seems to lack of a negative sign. I’ve checked the new version of the paper, and it also fixed this. Check the v3 version of the paper https://arxiv.org/pdf/2009.01325.pdf
Agree with you
The loss function of the reward model seems to lack of a negative sign. I’ve checked the new version of the paper, and it also fixed this. Check the v3 version of the paper https://arxiv.org/pdf/2009.01325.pdf
Agree with you