Is RLHF like fine-tuning a classification model at the end of each generation?

The RLHF is essentially selecting an output from a prompt (from multiple available options). Can this be treated like a classification model at the end of each prompt generation?

Just like we had prompt fine-tuning, is it possible to update the model with this classification model?

Yes, the RLHF can be treated like a classification model at the end of each prompt generation. It is possible to update the model with this classification model just like we had prompt fine-tuning.

2 Likes