How to calculate Q(s,a) in stochastic environment?

Here’s the link to the video I’m talking about: https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning/lecture/rL525/random-stochastic-environment-optional

The video explains that when there is a small chance of a mis-step, you calculate Q(s, a) = R(s) + (gamma) * avg(Q(s’, a’)).

How is avg(Q(s’, a’)) calculated?

Hi @devashishdxt

The average Q-value for the next state-action pairs is calculated by considering the expected value of Q(s’, a’) over all possible next states and actions, weighted by their probabilities of occurrence:

avg(Q(s', a')) = \sum_{s', a'} P(s', a' | s, a) \times Q(s', a')

This weighted sum gives the average Q-value for the next state-action pairs.

Hope this help!

Thanks for the response @Alireza_Saei. I understand the concept but don’t understand how to compute this.

Let’s take an example of 4 states:

State 1: Reward = 100 (terminal)
State 2: Reward = 0
State 3 Reward = 0
State 4: Reward = 40 (terminal)

States 1 2 3 4
Reward 100 0 0 40

Also, assume that discount factor is 0.5 and mis-step probability is 0.1.

So,
Q(1, left) = 100
Q(1, right) = 100

Q(4, left) = 40
Q(4, right) = 40

Q(2, left) = 0 + 0.5 * [0.9 * max(Q(1, left), Q(1, right)) + 0.1 * max(Q(3, left), Q(3, right))]
Q(2, right) = 0 + 0.5 * [0.9 * max(Q(3, left), Q(3, right)) + 0.1 * max(Q(1, left), Q(1, right))]

Q(3, left) = 0 + 0.5 * [0.9 * max(Q(2, left), Q(2, right)) + 0.1 * max(Q(4, left), Q(4, right))]
Q(3, right) = 0 + 0.5 * [0.9 * max(Q(4, left), Q(4, right)) + 0.1 * max(Q(2, left), Q(2, right))]

Now,

max(Q(1, left), Q(1, right)) = 100
max(Q(4, left), Q(4, right)) = 40

So,

Q(2, left) = 45 + 0.05 * max(Q(3, left), Q(3, right))
Q(2, right) = 5 + 0.45 * max(Q(3, left), Q(3, right))

Q(3, left) = 2 + 0.45 * max(Q(2, left), Q(2, right))
Q(3, right) = 18 + 0.05 * max(Q(2, left), Q(2, right))

How do I calculate the remaining values?

Please let me know if I’m completely wrong about this and there’s and easy way to find the values.

You are correct. To compute the remaining values, you need to iteratively update the Q-values until convergence by applying Bellman equation until the Q-values stabilize.

In each iteration, you update the Q-values for each state-action pair using the Bellman equation: (You continue this process until the Q-values converge)

Q(s, a) = R(s) + γ × avg(Q(s', a'))

In the first iteration, we start by using the initialized Q-values. Then, we update these values based on the Bellman equation to refine our estimates. I can show you how the calculation is done if you want!