# How to calculate Q(s,a) in stochastic environment?

The video explains that when there is a small chance of a mis-step, you calculate Q(s, a) = R(s) + (gamma) * avg(Q(s’, a’)).

How is avg(Q(s’, a’)) calculated?

The average Q-value for the next state-action pairs is calculated by considering the expected value of Q(s’, a’) over all possible next states and actions, weighted by their probabilities of occurrence:

avg(Q(s', a')) = \sum_{s', a'} P(s', a' | s, a) \times Q(s', a')

This weighted sum gives the average Q-value for the next state-action pairs.

Hope this help!

Thanks for the response @Alireza_Saei. I understand the concept but don’t understand how to compute this.

Let’s take an example of 4 states:

State 1: Reward = 100 (terminal)
State 2: Reward = 0
State 3 Reward = 0
State 4: Reward = 40 (terminal)

States 1 2 3 4
Reward 100 0 0 40

Also, assume that discount factor is 0.5 and mis-step probability is 0.1.

So,
Q(1, left) = 100
Q(1, right) = 100

Q(4, left) = 40
Q(4, right) = 40

Q(2, left) = 0 + 0.5 * [0.9 * max(Q(1, left), Q(1, right)) + 0.1 * max(Q(3, left), Q(3, right))]
Q(2, right) = 0 + 0.5 * [0.9 * max(Q(3, left), Q(3, right)) + 0.1 * max(Q(1, left), Q(1, right))]

Q(3, left) = 0 + 0.5 * [0.9 * max(Q(2, left), Q(2, right)) + 0.1 * max(Q(4, left), Q(4, right))]
Q(3, right) = 0 + 0.5 * [0.9 * max(Q(4, left), Q(4, right)) + 0.1 * max(Q(2, left), Q(2, right))]

Now,

max(Q(1, left), Q(1, right)) = 100
max(Q(4, left), Q(4, right)) = 40

So,

Q(2, left) = 45 + 0.05 * max(Q(3, left), Q(3, right))
Q(2, right) = 5 + 0.45 * max(Q(3, left), Q(3, right))

Q(3, left) = 2 + 0.45 * max(Q(2, left), Q(2, right))
Q(3, right) = 18 + 0.05 * max(Q(2, left), Q(2, right))

How do I calculate the remaining values?