Lunar lander reward


With the discrete state version, the reward only depends on s. But with the continuous version, the reward depends on the action taken.
Could you comment on that?
Also in Q(s, a) = R(s) + Gamma * max Q(s’, a’), does this reward apply to s’ or s?



Hi @ljb1706,

I could scent two potential problems here, from your question.

  1. We always model the Q-value, but not the reward. They are not identical - the former is what we anticipate for, while the latter is what we get.

  2. The lectures, with those mars rover examples, are delivering the idea of how we compute Q-values, but not on the purpose of demonstrating how rewards and states are one-one-corresponded (even though they are).

I am afraid the mars rover examples have given you a too-strong impression, so let’s get out of it and think about a different example first:

A candidate may score full marks by cheating, or they may score full marks by hard working. Obviously, they both get full marks, but we know the Q value for taking these two different actions should be different if the reward system encourages good understanding and use of knowledge. Here, we see both inputs of s and a in Q( fresh student, hard-working ) matter, and it has nothing to do with being discrete or continuous.

Now, back to our mars rover:

Here, don’t focus ourselves on how the rewards are fixed at each state - they are fixed, but should never be the focus.

The focus is, consider the rewards hidden from us (which is very realistic), how we anticipate our Q-value, then how we FIGURE OUT the rewards by exploration, and improve our Q-value model with those rewards.

Some do’s and don’ts:

  1. we don’t say the Q-value of being in state 6 is 40
  2. we don’t generally know what the rewards are in prior
  3. we say the reward in state 6 is 40
  4. we say the Q-value of being in state 5 and taking the action of “right” to be 20.
  5. we model the Q-value, because we want to know what we should do being in state 5.

The formula speaks for itself clearly - no ambiguity - Q(s, a) is a function of s and a, whereas R is function of s. Q and R are different.

Tell yourself these - when you say “reward”, did you SPECIFICALLY refer to which ONE of Q(s, a), R(s), and Q(s’, a’)? Any mixing up of concepts? Are Q and R very different indeed in your understanding? Did you know we are modeling Q, and not R? (No need to tell us here, unless you want to :wink: )


1 Like

I was wondering whether having R independent of a was a required assumption for the applicability of the approach.
Thanks for clarifying.
I guess a more detailed definition of Q would be Q(s, a) = R(s, a) + Gamma * max Q(s’, a’).

No, it is not required. If you have an example that R(s, a) is inevitable, please share.

So I think we agree: the reward R may, and in most case does, depend on the chosen action a.
So, at each step, we are maximizing the return Q over a. I guess this is the recursive nature of the Bellman equation: To solve max Q(s, a), we need to solve max of Q(s’, a’).

No, s’ depends on s and a. R depends on s.

I still feel that there is some mixing up between Q and R.

Q-value function is the function we create to model how good an action is. R does not take this role.

OK. Well now I am confused.
In the continuous lunar lander example Andrew says: “To encourage it not to waste too much fuel and fire thrusters than necessary, every time it fires the main engine we give it a -0.3 rewards and every time it fires the left or the right side thrusters we give it a -0.03 reward.”

You can consider the number of times each of those engines are fired as components of the state vector. Alternatively, you might consider the fuel level or fuel consumption as one component of the state vector, and then somehow the fuel drops 10 times more to fire the main engine than the other two.

1 Like

Alright. If you have more examples, we can discuss them.

No one can stop you from designing a reward system that rewards the action besides the outcome, if you think it is justifiable, but just be careful not to lead the agent to think that just taking action is good enough.

Also, sometimes, when your environment is probabilistic, meaning that taking the action of going left doesn’t have to end up going left, then the intention to design an action-based reward system may be even smaller.

Thanks. I also like the alternative to include this “reward” in the state.