In your code for preparing a dataset for fine-tuning a language model using PPO (Proximal Policy Optimization) for Reinforcement Learning from Human Feedback (RLHF), you’ve chosen to tokenize and then decode the text, which looks redundant. why not just say sample[“query”] = prompt
Maybe its redundant, try to check with just the input text, lets see if the PPO accepts it that way, and also check if the decode output is the same as the original text!