Thanks for the response @Alireza_Saei. I understand the concept but don’t understand how to compute this.
Let’s take an example of 4 states:
State 1: Reward = 100 (terminal)
State 2: Reward = 0
State 3 Reward = 0
State 4: Reward = 40 (terminal)
States |
1 |
2 |
3 |
4 |
Reward |
100 |
0 |
0 |
40 |
Also, assume that discount factor is 0.5 and mis-step probability is 0.1.
So,
Q(1, left) = 100
Q(1, right) = 100
Q(4, left) = 40
Q(4, right) = 40
Q(2, left) = 0 + 0.5 * [0.9 * max(Q(1, left), Q(1, right)) + 0.1 * max(Q(3, left), Q(3, right))]
Q(2, right) = 0 + 0.5 * [0.9 * max(Q(3, left), Q(3, right)) + 0.1 * max(Q(1, left), Q(1, right))]
Q(3, left) = 0 + 0.5 * [0.9 * max(Q(2, left), Q(2, right)) + 0.1 * max(Q(4, left), Q(4, right))]
Q(3, right) = 0 + 0.5 * [0.9 * max(Q(4, left), Q(4, right)) + 0.1 * max(Q(2, left), Q(2, right))]
Now,
max(Q(1, left), Q(1, right)) = 100
max(Q(4, left), Q(4, right)) = 40
So,
Q(2, left) = 45 + 0.05 * max(Q(3, left), Q(3, right))
Q(2, right) = 5 + 0.45 * max(Q(3, left), Q(3, right))
Q(3, left) = 2 + 0.45 * max(Q(2, left), Q(2, right))
Q(3, right) = 18 + 0.05 * max(Q(2, left), Q(2, right))
How do I calculate the remaining values?
Please let me know if I’m completely wrong about this and there’s and easy way to find the values.