There is something i cannot understand about the state action value function.Why for the terminal states the values of Q are always the rewards for those state, it does not match with the definition of the Q function ( take the action ‘a’ first and then behave optimally)
In your understanding, what do you think those Q values should be instead? Please show the calculation steps if any.
Well , if i follow the steps shown in the videos Q(6,←) in the first picture for example will be:
Q(6,←)=40+0.5x0+(0.5²)x40=50. because the reward is 40 in our actual state then we take the action go left which corresponds to 0.5x0 , then we behave optimally i.e getting back to state 6 but now with the discount factor squared (0.5² x40)
I see, so we leave the terminal state 6 and get to state 5, and then go back to the terminal state 6.
However, if we can leave the terminal state once, then, after going back to the terminal state, can we leave it and go back again, and repeat this infinitely many times?
Mathematically, repeating that won’t be a problem, but do we need that? The answer is no, because reaching the terminal, here, means that the journey is ended and there is no more moving.
ok so if i understand , by convention when we are at a terminal state we are done and we don’t do anything , and only in that case we ignore the point in the definition of the Q function which says take action a . I will add another point , there is a kind of recursion in the bellman equation and it fully makes sense i guess becauze we need an evident case for Q function. Is that correct?
What do you mean by that - evident case for Q function?