Hi All,
I was able to understand the training process of the reward model. I want to know how do we give/ feed (Prompt X, Y_j) and (Prompt X, Y_k) and labels [0,1] to data loaders and how are these preprocessed and given to the reward model.
Since we have like 3 sets of pairwise completions for each prompt X, I am not able to comprehend how we give them as input and labels to the reward model. Please do let me know in detail, if possible with a code snippet.
Thank you