If reinforcement learning works on reward based system which is all we wanted to do with the lunar lander, why do we need to max X ( current state and action ) with Y (rewards)? What would learning the parameters of this mapping provide us exactly?
Hello @Mohd_Farhan_Hassan thanks for posting your query, could you please specify the video from which you are getting query so it may help us to solve your doubt in more better way