Hi @ljb1706,
I could scent two potential problems here, from your question.
-
We always model the Q-value, but not the reward. They are not identical - the former is what we anticipate for, while the latter is what we get.
-
The lectures, with those mars rover examples, are delivering the idea of how we compute Q-values, but not on the purpose of demonstrating how rewards and states are one-one-corresponded (even though they are).
I am afraid the mars rover examples have given you a too-strong impression, so let’s get out of it and think about a different example first:
A candidate may score full marks by cheating, or they may score full marks by hard working. Obviously, they both get full marks, but we know the Q value for taking these two different actions should be different if the reward system encourages good understanding and use of knowledge. Here, we see both inputs of s and a in Q( fresh student, hard-working ) matter, and it has nothing to do with being discrete or continuous.
Now, back to our mars rover:
Here, don’t focus ourselves on how the rewards are fixed at each state - they are fixed, but should never be the focus.
The focus is, consider the rewards hidden from us (which is very realistic), how we anticipate our Q-value, then how we FIGURE OUT the rewards by exploration, and improve our Q-value model with those rewards.
Some do’s and don’ts:
- we don’t say the Q-value of being in state 6 is 40
- we don’t generally know what the rewards are in prior
- we say the reward in state 6 is 40
- we say the Q-value of being in state 5 and taking the action of “right” to be 20.
- we model the Q-value, because we want to know what we should do being in state 5.
The formula speaks for itself clearly - no ambiguity - Q(s, a) is a function of s and a, whereas R is function of s. Q and R are different.
Tell yourself these - when you say “reward”, did you SPECIFICALLY refer to which ONE of Q(s, a), R(s), and Q(s’, a’)? Any mixing up of concepts? Are Q and R very different indeed in your understanding? Did you know we are modeling Q, and not R? (No need to tell us here, unless you want to )
Raymond