Clarification on Optional video: Proximal policy optimization

the MDP formulation introduced in this video is a bit different from what I understood from the instructGPT paper: I think in that paper the reward model doesn’t actually evaluate the quality of an incomplete response, i.e., the response is only assigned with a reward when a complete sentence is formed. I think this understanding is aligned with how the reward model is trained, i.e., the reward model is not trained to assign a reward to an incomplete response.