Lunar lander reward

ljb1706 · November 11, 2023, 2:38pm

Hi,

With the discrete state version, the reward only depends on s. But with the continuous version, the reward depends on the action taken.
Could you comment on that?
Also in Q(s, a) = R(s) + Gamma * max Q(s’, a’), does this reward apply to s’ or s?

Thanks,

Laurent.

rmwkwok · November 12, 2023, 12:43am

Hi @ljb1706,

I could scent two potential problems here, from your question.

We always model the Q-value, but not the reward. They are not identical - the former is what we anticipate for, while the latter is what we get.
The lectures, with those mars rover examples, are delivering the idea of how we compute Q-values, but not on the purpose of demonstrating how rewards and states are one-one-corresponded (even though they are).

I am afraid the mars rover examples have given you a too-strong impression, so let’s get out of it and think about a different example first:

A candidate may score full marks by cheating, or they may score full marks by hard working. Obviously, they both get full marks, but we know the Q value for taking these two different actions should be different if the reward system encourages good understanding and use of knowledge. Here, we see both inputs of s and a in Q( fresh student, hard-working ) matter, and it has nothing to do with being discrete or continuous.

Now, back to our mars rover:

Here, don’t focus ourselves on how the rewards are fixed at each state - they are fixed, but should never be the focus.

The focus is, consider the rewards hidden from us (which is very realistic), how we anticipate our Q-value, then how we FIGURE OUT the rewards by exploration, and improve our Q-value model with those rewards.

Some do’s and don’ts:

we don’t say the Q-value of being in state 6 is 40
we don’t generally know what the rewards are in prior
we say the reward in state 6 is 40
we say the Q-value of being in state 5 and taking the action of “right” to be 20.
we model the Q-value, because we want to know what we should do being in state 5.

The formula speaks for itself clearly - no ambiguity - Q(s, a) is a function of s and a, whereas R is function of s. Q and R are different.

Tell yourself these - when you say “reward”, did you SPECIFICALLY refer to which ONE of Q(s, a), R(s), and Q(s’, a’)? Any mixing up of concepts? Are Q and R very different indeed in your understanding? Did you know we are modeling Q, and not R? (No need to tell us here, unless you want to )

Raymond

ljb1706 · November 12, 2023, 6:41am

I was wondering whether having R independent of a was a required assumption for the applicability of the approach.
Thanks for clarifying.
I guess a more detailed definition of Q would be Q(s, a) = R(s, a) + Gamma * max Q(s’, a’).

rmwkwok · November 12, 2023, 6:49am

No, it is not required. If you have an example that R(s, a) is inevitable, please share.

ljb1706 · November 12, 2023, 7:00am

So I think we agree: the reward R may, and in most case does, depend on the chosen action a.
So, at each step, we are maximizing the return Q over a. I guess this is the recursive nature of the Bellman equation: To solve max Q(s, a), we need to solve max of Q(s’, a’).

rmwkwok · November 12, 2023, 7:01am

No, s’ depends on s and a. R depends on s.

rmwkwok · November 12, 2023, 7:11am

I still feel that there is some mixing up between Q and R.

Q-value function is the function we create to model how good an action is. R does not take this role.

ljb1706 · November 12, 2023, 7:12am

OK. Well now I am confused.
In the continuous lunar lander example Andrew says: “To encourage it not to waste too much fuel and fire thrusters than necessary, every time it fires the main engine we give it a -0.3 rewards and every time it fires the left or the right side thrusters we give it a -0.03 reward.”

rmwkwok · November 12, 2023, 7:12am

You can consider the number of times each of those engines are fired as components of the state vector. Alternatively, you might consider the fuel level or fuel consumption as one component of the state vector, and then somehow the fuel drops 10 times more to fire the main engine than the other two.

rmwkwok · November 12, 2023, 7:25am

Alright. If you have more examples, we can discuss them.

No one can stop you from designing a reward system that rewards the action besides the outcome, if you think it is justifiable, but just be careful not to lead the agent to think that just taking action is good enough.

Also, sometimes, when your environment is probabilistic, meaning that taking the action of going left doesn’t have to end up going left, then the intention to design an action-based reward system may be even smaller.

ljb1706 · November 12, 2023, 7:28am

Thanks. I also like the alternative to include this “reward” in the state.

Topic		Replies	Views
Definition of Reward Unsupervised Learning, Recommenders, Reinforcement week-3	4	659	October 22, 2022
Question on discounting Unsupervised Learning, Recommenders, Reinforcement week-3	8	482	November 7, 2022
Confusion regarding basic mathematics of DQN Algorithm Unsupervised Learning, Recommenders, Reinforcement week-3	11	341	February 13, 2024
How does R(s) reward an action (e.g. firing engines) which is not part of the state? Unsupervised Learning, Recommenders, Reinforcement week-3	1	594	August 18, 2022
Reinforcement Learning Intial State and reward Unsupervised Learning, Recommenders, Reinforcement week-3	10	513	March 22, 2023

Lunar lander reward

Related topics