Unsupervised Learning: Bellman Equation example looks incorrect

I am trying to understand why the return of state 3 and 2 are 0, and return of state 1 is 100.

My understanding is that the formula should be:
R(4) + 0.5(25) + 0.5^2(50) + 0.5^3(100)
which would be the same as
R(4) + 0.5(25+(0.5)*50+0.5^2+100)

Could someone confirm if my understand is correct? If not, what am I missing? Should I disregard the return values in red font?

Hi @jeshwang

As you know, the red values in the table are the maximum Q-values for those states. State 1 is terminal, so the return is directly 100. For states 2 and 3, the immediate reward R(s) is 0, and they don’t lead directly to the terminal state with any higher rewards. So, the discounted future rewards are also 0. You can ignore the red values and instead focus on the formula Q(s, a) = R(s) + \gamma \max_{a'} Q(s', a') .

Hope it helps! Feel free to ask if you need further assistance.

Your formula is not entirely correct for this problem, as it doesn’t seem to align with the discounted values in the context of reinforcement learning.

You were suggesting:

R(4) + 0.5(25) + 0.5^2(50) + 0.5^3(100)

This is not correct because the Bellman equation focuses on the maximum of the next state’s Q -values (not adding them all). The discounting happens on the best action from the next state. The formula should instead evaluate:

R(4) + \gamma \max_{a'} Q(3, a')

And since Q(3, a') = 0 , the result becomes just R(4) , which is 0 .

1 Like

The clarification of the formula makes sense. From what I am reading, Bellman equation is inherently recursive in that the next step depends on the value of future states, which makes it recursive. Am I on the right path here?

Yes, you are right! The Bellman equation is inherently recursive because it breaks down the value of a state-action pair into two parts: the immediate reward R(s) for being in the current state s and the expected future reward \gamma \max_{a'} Q(s', a') , which depends on the value of future states. This recursive nature means that the value of being in a state depends not only on the immediate reward but also on the future rewards, which are computed using the Bellman equation.

The key aspect of this recursion is that the value of a current state s is tied to the values of future states s' . To compute the expected value of state s taking action a , Q(s, a) , one must know \max_{a'} Q(s', a') , the maximum future reward from the next state s' after choosing the best action a' . However, the value of Q(s', a') is computed using the same equation, which depends on even more future states. For example:

  • To calculate Q(4, \leftarrow) , you need to know \max_{a'} Q(3, a') .
  • To calculate Q(3, a') , you need to know \max_{a'} Q(2, a') , and so on.

This recursive structure allows algorithms such as Value Iteration and Q-Learning to iteratively refine the value estimates for each state-action pair by repeatedly applying the Bellman equation until the values converge. When the recursion stops because it reaches a terminal state, such as state 1, with a fixed reward of 100 and no future states to consider, it serves as the base case for the recursion. This allows the values to propagate backward through the system of states.