Hello Kaitian,
Congratulations for making it to the last lab. This RL lab is my favourite lab of the specialization, and so I also read the underlying code and indeed the reward returned from the .step function considers both the current and the next state, so I agree that it’s more like a reward from the next state.
However, this also brings in an interesting point that in this case, a state doesn’t always has the same reward, because we always need to know the two consecutive states to calculate the reward. How would we assign the reward? Is it to the current state, or to the next state? Sounds like it can be controversial, doesn’t it?
But let’s put this aside for a while and look at another fact, which is that the loss function we train the DQN doesn’t have to be the Bellman equations, no matter how indeed our lab’s loss function looks so like the Bellman equation. Now, with such relaxation of using just any form of loss function, the inconsistency should be gone, right? I personally like the idea of using the loss function the lab is using because I want my DQN to learn what rewards to get by taking this action at this state. That’s it. That’s my rational for accepting the loss function, while being happy with the bellman equation.
Raymond