Hello,
There is something i cannot understand about the state action value function.Why for the terminal states the values of Q are always the rewards for those state, it does not match with the definition of the Q function ( take the action ‘a’ first and then behave optimally)
Hello @abdou_brk,
In your understanding, what do you think those Q values should be instead? Please show the calculation steps if any.
Cheers,
Raymond
Well , if i follow the steps shown in the videos Q(6,←) in the first picture for example will be:
Q(6,←)=40+0.5x0+(0.5²)x40=50. because the reward is 40 in our actual state then we take the action go left which corresponds to 0.5x0 , then we behave optimally i.e getting back to state 6 but now with the discount factor squared (0.5² x40)
Hello @abdou_brk,
I see, so we leave the terminal state 6 and get to state 5, and then go back to the terminal state 6.
However, if we can leave the terminal state once, then, after going back to the terminal state, can we leave it and go back again, and repeat this infinitely many times?
Mathematically, repeating that won’t be a problem, but do we need that? The answer is no, because reaching the terminal, here, means that the journey is ended and there is no more moving.
Agree?
Cheers,
Raymond
ok so if i understand , by convention when we are at a terminal state we are done and we don’t do anything , and only in that case we ignore the point in the definition of the Q function which says take action a . I will add another point , there is a kind of recursion in the bellman equation and it fully makes sense i guess becauze we need an evident case for Q function. Is that correct?
Yes!
What do you mean by that - evident case for Q function?
Cheers,
Raymond
@rmwkwok
I hope you don’t mind me jumping in and reviving an old post, but I have a question about the exact same slide.
In the lecture screenshot above, the bottom right blue hand-written examples, we see that Q = the immediate reward for being in the starting state + discounted sum of future rewards when acting optimally.
So Q(2,right) = reward in state 2 + 0.5 * reward in state 3 + 0.5^2 * reward in state 2 and 0.5^3 * reward in state 1.
But other definitions I see in most other material is that the immediate reward R(s,a) is the reward for starting in state s and taking action a.. Which means by almost all other definitions I see (other than the ones quoting this course) would write Q(2,right) as
Q(2,right) = reward in state 3 + 0.5 * reward in state 2 + 0.5^2 * reward in state 1
Another words, the first reward you account for is the one you get for taking that specific action (so in this case, it’s the zero in state 3).
Can you help me understand where the inconsistency is coming from or if I’m seriously misunderstanding something? I’m at the point that these details matter to me especially when going to more advanced methods like PPO where you’re calculating advantage functions, rewards-to-go etc.
Thank you so much!
Hello, @BikerS,
I would like to bring us to a discussion rather than to an answer. Here are the two sides I will consider this:
- The meaning of R(s = 1) = 100: (1) You get 100 for taking whatever action from there? Or, as you said, (2) you get 100 for being there? In these two interpretations, we have action-taking versus being. I believe the former interpretation will fit “those most other definitions” you mentioned. Here are two pieces of info to consider:
- In this lecture from 2:10, Andrew explained the number as “reward at state S is XX”. This explanation does not reject any of the two interpretations.
- The lectures write reward R(s) as a function of s. This seems to suggest “being”. However, we also see R(s, a) to imply that reward is a result of an action, but since a = \pi(s), R(s, \pi(s)) is still a function of s. In this sense, the use of R(s) neither rejects any of the interpretations.
From the above, we can interpret the lecture in a way that is consistent with “those other definitions”, by calling those reward values as the rewards after taking any action from the respective states.
- A rather practical side is to consider how the environment gives us rewards.
Let me quote from Sutton and Barto’s “Reinforcement Learning: An Introduction” (btw, it is a nice book):
Apparently, the description should be with “those other definitions” - reward is after action. Additionally, the figure says there is a R_t for each S_t, but clearly, to be consistent with the description, R_0 should not exist. Now, what I want to ask is, would there be an environment that will always give us a R_t for each S_t, including t = 0? In other words, would initial rewards matter? If it does, then if R_0 can’t be an immediate reward, we will lose this information forever.
The second side is a dependence on the environment itself. For problems where the agent always starts from the same state with the same initial reward, then R_0 is unnecessary, so they should all be “reward after action”. However, even if the initial reward can be different, if it is not a result of an action taken by the agent, is that still necessary because we are modeling the agent’s value function? Now, it seems that R_0 is unnecessary in any way, but I prefer to keep things open for that I may just be not experienced enough to have gone through such situation.
All in all, in the spirit of a discussion (and not as an answer), I hope the first side would be sufficient for us to feel easy with both the lecture and the common definition (as by Sutton and Barto), whereas the second side might leave us some room for R_0.
Cheers,
Raymond
Very nice discussion. Thank you!
I guess my takeaway is that the answer comes down to a “matter of self-consistent definition” when applying these. So as long as we’re consistent with our definitions in a particular implementation, due to the nature of the recursive nature of the Bellman equations, the algorithms will work out.
Basically whether we start accounting at state zero and the cake the agent had right from the start, or not, will not change the agents behavior. When seeking the rewards at t1 and onwards.
It’s like knocking off one power off the gamma everywhere in expected returns. The gradient still pushes the agent toward other returns
The bellman equation takes all usable rewards into account, only discounting them differently.
For me to think about it in terms of “self-consistency”, I would add that “useful rewards is not missed out”. I would ask myself, that if “self-consistency” is the top thing, does it mean that I could knock off some R_t consistently?
I am sure you didn’t mean to drop off useful rewards, or you may say No to my above question, but I want to just use this post to emphasize on the part of “useful rewards”. This is also why I said, if the initial reward is the same for every episode, then we could “knock it off”, or if the initial reward is not the result of the agent’s action, it is questionable whether we need to keep it.
We want as much information as possible to be injected into the model.
Happy to have discussed with you, @BikerS
Cheers,
Raymond