The RLHF is essentially selecting an output from a prompt (from multiple available options). Can this be treated like a classification model at the end of each prompt generation?
Just like we had prompt fine-tuning, is it possible to update the model with this classification model?