Thanks for the response @Alireza_Saei. I understand the concept but don’t understand how to compute this.

Let’s take an example of 4 states:

State 1: Reward = 100 (terminal)

State 2: Reward = 0

State 3 Reward = 0

State 4: Reward = 40 (terminal)

States |
1 |
2 |
3 |
4 |

Reward |
100 |
0 |
0 |
40 |

Also, assume that discount factor is 0.5 and mis-step probability is 0.1.

So,

Q(1, left) = 100

Q(1, right) = 100

Q(4, left) = 40

Q(4, right) = 40

Q(2, left) = 0 + 0.5 * [0.9 * max(Q(1, left), Q(1, right)) + 0.1 * max(Q(3, left), Q(3, right))]

Q(2, right) = 0 + 0.5 * [0.9 * max(Q(3, left), Q(3, right)) + 0.1 * max(Q(1, left), Q(1, right))]

Q(3, left) = 0 + 0.5 * [0.9 * max(Q(2, left), Q(2, right)) + 0.1 * max(Q(4, left), Q(4, right))]

Q(3, right) = 0 + 0.5 * [0.9 * max(Q(4, left), Q(4, right)) + 0.1 * max(Q(2, left), Q(2, right))]

Now,

max(Q(1, left), Q(1, right)) = 100

max(Q(4, left), Q(4, right)) = 40

So,

Q(2, left) = 45 + 0.05 * max(Q(3, left), Q(3, right))

Q(2, right) = 5 + 0.45 * max(Q(3, left), Q(3, right))

Q(3, left) = 2 + 0.45 * max(Q(2, left), Q(2, right))

Q(3, right) = 18 + 0.05 * max(Q(2, left), Q(2, right))

How do I calculate the remaining values?

Please let me know if I’m completely wrong about this and there’s and easy way to find the values.