In the lecture notes, the Bellman equation is defined as Q(s,a) = R_s+ \gamma \cdot max_{a'}Q(s',a') where s' is the next state after taking action a at state s.
In the practice lab Exercise 2, it said that define y_j to be R_j if episode ends at step j+1. However, if episode ends at step j+1 (e.g. we land sucessfully), then we can get reward at the last step. Taking y_j = R_j ignores the reward we can get in the last step.
Then I closely look at the definition of the R_j in the code. It seems like in the agent-environment model, the “.step” function returns the reward of the next state. So R_j should instead be the reward of taking action a at state j.
In the Section 9 Train the agent, the code has
next_state, reward, done, _ = env.step(action)
#tore experience tuple (S,A,R,S') in the memory buffer.
#We store the done variable as well for convenience.
memory_buffer.append(experience(state, action, reward, next_state, done))
So the (S,A,R,S’) tuple in the memory_buffer are not exactly the same as the definition in the lecture. The third element R is not the reward of current state S, but the reward of taking action A in the current state S.
So the original definition for Q(s,a) is Q(s,a)=reward of state S + optimal actions after taking action a (with discount factor \gamma). Now in the code, it seems like Q(s,a)=optimal actions after taking action a (with discount factor \gamma but not discount at the immediate reward).
I don’t think this changes the structure or validity of the model but it does make some confusion.