Optimal policy and the Q left, Q right values


I am quite confused with the State-action-value lab of Reinforcement Learning.

In the lab, the Mars rover example described in the lecture was implemented.

As per my understanding, for discrete state space,

  1. num_states, number_actions, rewards are initialized
  2. The policy is initialized with random actions initially. (In the lab, it was left actions for all the states)
  3. The Q values for all the states are also initialized randomly (all zeros in the lab)
  4. Now with this initial setup, the q value is calculated for each state using the initial policy
  5. From here, the policy is improvised/optimized by calculating the q value for both actions and taking the best q out of the 2 actions. That q at that state is also updating.
  6. This process is done until there is very less difference between the past and current q value at that state while updating.
  7. This is the process of optimizing the policy and q value at each state.
  8. Finally optimal policy and q are returned.

After watching the lecture, I was under the impression that finding out Q_left and Q_right at each state and then using Bellman’s equation at each state will help in determining the optimal policy . But after looking into the lab, the optimal policy is already being calculated first. After that, q_left and q_right are calculated using the optimal policy. I couldn’t wrap my head around this. Why is a need to calculate Q_left and Q_right using the already found optimal_policy?

Is it just for illustration purposes?

How is it usually being done in practice?

Also, in the calculate_Q_value( ):
q_sa = rewards[state] + gamma * sum([transition_prob[state, action, sp] * V_states[sp] for sp in range(num_states)])

But if it is a random stochastic environment and the Bellman equation, the Q value should be the expected/average value of Q(s’, a’) right? Why are we only just performing sum here?

Kindly help me with this.


Best Regards,

You didn’t explain why it is a problem to compute Q_left and Q_right.

Did the code never use the computed Q_left and Q_right, so you were questioning the need?

Or, were you questioned about the need because, instead of computing Q_left and Q_right, there were some other real need ?

Yes, it should be the expected value. It seems to me you were wondering why we didn’t divide the sum by anything. If you think one step further, what should we have divided the sum by?


Hi Raymond,

I got to understand, that finding out q_left and q_right is actually being calculated in the improve_policy(). So, the policy is first initialized to be random, and then based on this random policy, the q values are calculated at each state with this random policy in the first step.

And from then, the policy is improved/optimized based on q values of action 0(left) and action 1(right) at every state in improve_policy().

Since the variables with names q_left and q_right are written after finding the optimal policy in generate_visualization(), I got confused. But now everything is clear. This is just to show the q_left and q_right values from which the optimal_policy is actually found.

Regarding the expected value of Q(s’, a’), according to me the expected value of Q(s’, a’) is the sum of all Q(s’, a’) divided by no. of policies there is.

But in the code, what we are actually doing is, we are iterating through each policy and getting the Q value with that policy.

After I think through this, if we have all the policies that are possible, then for any starting state, the return or Q(s, a) could be calculated by taking Q(s’, a’) from all these policies and averaging them.

But, in the code, we are trying to find the Q(s’, a’) using the policy that is there at the moment and then we are already updating the previous policy and then, also updating the Q(s, a) and Q(s’, a’). So, there is no point in averaging the Q(s’, a’).

Is my understanding correct? Kindly correct me if there is something that needs to be corrected.

Thanks as always!

Hello @bhavanamalla,

The expectation operation is not over all possible policies. If you watch the following lecture again:

Each row of the blue numbers was representing one possible path, and the expectation (or the average) operation is done over those paths (instead of policy).

However, the lab didn’t simulate the paths, and it does not have to either, because the transition_prob already contain all the information about how the bot transits. In other words, if we are interested how the bot moves, concentrating on the transition_prob is already sufficient, rather than having to generate the paths out.

The expected return is calculated in line 33. And what it does is the following:


Thanks a lot for clearing out my confusion @rmwkwok :raised_hands: