Inconsistent definition for the Bellman equations

In the lecture notes, the Bellman equation is defined as Q(s,a) = R_s+ \gamma \cdot max_{a'}Q(s',a') where s' is the next state after taking action a at state s.

In the practice lab Exercise 2, it said that define y_j to be R_j if episode ends at step j+1. However, if episode ends at step j+1 (e.g. we land sucessfully), then we can get reward at the last step. Taking y_j = R_j ignores the reward we can get in the last step.

Then I closely look at the definition of the R_j in the code. It seems like in the agent-environment model, the “.step” function returns the reward of the next state. So R_j should instead be the reward of taking action a at state j.

In the Section 9 Train the agent, the code has

    next_state, reward, done, _ = env.step(action)
    #tore experience tuple (S,A,R,S') in the memory buffer.
    #We store the done variable as well for convenience.
    memory_buffer.append(experience(state, action, reward, next_state, done))

So the (S,A,R,S’) tuple in the memory_buffer are not exactly the same as the definition in the lecture. The third element R is not the reward of current state S, but the reward of taking action A in the current state S.

So the original definition for Q(s,a) is Q(s,a)=reward of state S + optimal actions after taking action a (with discount factor \gamma). Now in the code, it seems like Q(s,a)=optimal actions after taking action a (with discount factor \gamma but not discount at the immediate reward).

I don’t think this changes the structure or validity of the model but it does make some confusion.

Hello Kaitian,

Congratulations for making it to the last lab. This RL lab is my favourite lab of the specialization, and so I also read the underlying code and indeed the reward returned from the .step function considers both the current and the next state, so I agree that it’s more like a reward from the next state.

However, this also brings in an interesting point that in this case, a state doesn’t always has the same reward, because we always need to know the two consecutive states to calculate the reward. How would we assign the reward? Is it to the current state, or to the next state? Sounds like it can be controversial, doesn’t it?

But let’s put this aside for a while and look at another fact, which is that the loss function we train the DQN doesn’t have to be the Bellman equations, no matter how indeed our lab’s loss function looks so like the Bellman equation. Now, with such relaxation of using just any form of loss function, the inconsistency should be gone, right? I personally like the idea of using the loss function the lab is using because I want my DQN to learn what rewards to get by taking this action at this state. That’s it. That’s my rational for accepting the loss function, while being happy with the bellman equation.


Thanks Raymond. I agree with you that a state doesn’t always has the same reward. The reward is more likely to be related to both the current state and the next state.

I don’t understand that DQN can have other loss functions. Can you elaborate more, for example, listing any papers on DQN using other loss functions?

Hello Kaitian,

I think one counter example is enough. More generally speaking, loss function (or sometimes called the objective function) has to be more related to your objective than just get fixed to any particular equation, agree? Designing the right objective function is a job.


This might be pedantic but the lectures said that the future only depends on the current state:

The term Markov in the MDP or Markov decision process refers to that the future only depends on the current state and not on anything that might have occurred prior to getting to the current state. In other words, in a Markov decision process, the future depends only on where you are now, not on how you got here.

source: Review of Key Concepts

Given this “Markov property”, is it the Reinforcement Learning in the Lunar Lander correct to call a Markov Decision Process?

I believe this is a Markov process. The next state of the lunar lander only depends on the current state. Say we are at the position (x,y) now. Then the next position we are going to be only depends on the current position (x,y), but not on how we arrive to the current position (x,y).

The code actually reflects that. The experience tuple (S,A,R,S’) only records the current state, current action, future state, and rewards. It doesn’t record any information about the route from the origin to the current state.

1 Like

Agree that it is a Markov process. I think the focus here is about the state, instead of the reward. The next state depends only on the current state and action.

Would the model’s accuracy improve if it took the past actions into account?

1 Like

I think we will need to test it to know it, and see whether there is any improvement or whether the improvement is worthwhile. It always comes with some computational costs to expand into the past.


P.S. For the lunar lander’s case, the state vector has velocity and angular velocity. Since they are time derivative quantities (e.g. velocity = displacement/time_interval), from the physics point of view, they also carry a bit of the history. I think this may be why they are helpful state values to be included in the state vector. :wink:

Thank you for all your answers Raymond. I too noticed that velocity might carry history but modeling it as a state variable seemed satisfactory.

1 Like

You are welcome Michael!

1 Like