How to calculate Q(s,a) in stochastic environment?

devashishdxt · May 9, 2024, 1:41am

Here’s the link to the video I’m talking about: https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning/lecture/rL525/random-stochastic-environment-optional

The video explains that when there is a small chance of a mis-step, you calculate Q(s, a) = R(s) + (gamma) * avg(Q(s’, a’)).

How is avg(Q(s’, a’)) calculated?

Alireza_Saei · May 9, 2024, 8:29am

Hi @devashishdxt

The average Q-value for the next state-action pairs is calculated by considering the expected value of Q(s’, a’) over all possible next states and actions, weighted by their probabilities of occurrence:

avg(Q(s', a')) = \sum_{s', a'} P(s', a' | s, a) \times Q(s', a')

This weighted sum gives the average Q-value for the next state-action pairs.

Hope this help!

devashishdxt · May 10, 2024, 1:18am

Thanks for the response @Alireza_Saei. I understand the concept but don’t understand how to compute this.

Let’s take an example of 4 states:

State 1: Reward = 100 (terminal)
State 2: Reward = 0
State 3 Reward = 0
State 4: Reward = 40 (terminal)

States	1	2	3	4
Reward	100	0	0	40

Also, assume that discount factor is 0.5 and mis-step probability is 0.1.

So,
Q(1, left) = 100
Q(1, right) = 100

Q(4, left) = 40
Q(4, right) = 40

Q(2, left) = 0 + 0.5 * [0.9 * max(Q(1, left), Q(1, right)) + 0.1 * max(Q(3, left), Q(3, right))]
Q(2, right) = 0 + 0.5 * [0.9 * max(Q(3, left), Q(3, right)) + 0.1 * max(Q(1, left), Q(1, right))]

Q(3, left) = 0 + 0.5 * [0.9 * max(Q(2, left), Q(2, right)) + 0.1 * max(Q(4, left), Q(4, right))]
Q(3, right) = 0 + 0.5 * [0.9 * max(Q(4, left), Q(4, right)) + 0.1 * max(Q(2, left), Q(2, right))]

Now,

max(Q(1, left), Q(1, right)) = 100
max(Q(4, left), Q(4, right)) = 40

So,

Q(2, left) = 45 + 0.05 * max(Q(3, left), Q(3, right))
Q(2, right) = 5 + 0.45 * max(Q(3, left), Q(3, right))

Q(3, left) = 2 + 0.45 * max(Q(2, left), Q(2, right))
Q(3, right) = 18 + 0.05 * max(Q(2, left), Q(2, right))

How do I calculate the remaining values?

Please let me know if I’m completely wrong about this and there’s and easy way to find the values.

Alireza_Saei · May 10, 2024, 7:18am

You are correct. To compute the remaining values, you need to iteratively update the Q-values until convergence by applying Bellman equation until the Q-values stabilize.

In each iteration, you update the Q-values for each state-action pair using the Bellman equation: (You continue this process until the Q-values converge)

Q(s, a) = R(s) + γ × avg(Q(s', a'))

In the first iteration, we start by using the initialized Q-values. Then, we update these values based on the Bellman equation to refine our estimates. I can show you how the calculation is done if you want!

Topic		Replies	Views
Random Stochastic Environment Question Unsupervised Learning, Recommenders, Reinforcement week-module-3	2	488	August 8, 2022
Random (stochastic) environment Q-values question Unsupervised Learning, Recommenders, Reinforcement week-module-3	5	489	April 14, 2023
State-action value function example? Unsupervised Learning, Recommenders, Reinforcement week-module-3	8	598	September 9, 2022
Expected Return calculation? Unsupervised Learning, Recommenders, Reinforcement week-module-3	2	573	January 14, 2023
Error in State-action value quiz Unsupervised Learning, Recommenders, Reinforcement week-module-3	8	511	June 11, 2024

How to calculate Q(s,a) in stochastic environment?

Related topics