Unsupervised Learning: Bellman Equation example looks incorrect

jeshwang · September 21, 2024, 3:11pm

I am trying to understand why the return of state 3 and 2 are 0, and return of state 1 is 100.

My understanding is that the formula should be:
R(4) + 0.5(25) + 0.5^2(50) + 0.5^3(100)
which would be the same as
R(4) + 0.5(25+(0.5)*50+0.5^2+100)

Could someone confirm if my understand is correct? If not, what am I missing? Should I disregard the return values in red font?

Alireza_Saei · September 21, 2024, 4:49pm

Hi @jeshwang

As you know, the red values in the table are the maximum Q-values for those states. State 1 is terminal, so the return is directly 100. For states 2 and 3, the immediate reward R(s) is 0, and they don’t lead directly to the terminal state with any higher rewards. So, the discounted future rewards are also 0. You can ignore the red values and instead focus on the formula Q(s, a) = R(s) + \gamma \max_{a'} Q(s', a') .

Hope it helps! Feel free to ask if you need further assistance.

nadtriana · September 22, 2024, 11:09am

Your formula is not entirely correct for this problem, as it doesn’t seem to align with the discounted values in the context of reinforcement learning.

You were suggesting:

R(4) + 0.5(25) + 0.5^2(50) + 0.5^3(100)

This is not correct because the Bellman equation focuses on the maximum of the next state’s Q -values (not adding them all). The discounting happens on the best action from the next state. The formula should instead evaluate:

R(4) + \gamma \max_{a'} Q(3, a')

And since Q(3, a') = 0 , the result becomes just R(4) , which is 0 .

jeshwang · September 22, 2024, 2:59pm

The clarification of the formula makes sense. From what I am reading, Bellman equation is inherently recursive in that the next step depends on the value of future states, which makes it recursive. Am I on the right path here?

nadtriana · September 22, 2024, 11:22pm

Yes, you are right! The Bellman equation is inherently recursive because it breaks down the value of a state-action pair into two parts: the immediate reward R(s) for being in the current state s and the expected future reward \gamma \max_{a'} Q(s', a') , which depends on the value of future states. This recursive nature means that the value of being in a state depends not only on the immediate reward but also on the future rewards, which are computed using the Bellman equation.

The key aspect of this recursion is that the value of a current state s is tied to the values of future states s' . To compute the expected value of state s taking action a , Q(s, a) , one must know \max_{a'} Q(s', a') , the maximum future reward from the next state s' after choosing the best action a' . However, the value of Q(s', a') is computed using the same equation, which depends on even more future states. For example:

To calculate Q(4, \leftarrow) , you need to know \max_{a'} Q(3, a') .
To calculate Q(3, a') , you need to know \max_{a'} Q(2, a') , and so on.

This recursive structure allows algorithms such as Value Iteration and Q-Learning to iteratively refine the value estimates for each state-action pair by repeatedly applying the Bellman equation until the values converge. When the recursion stops because it reaches a terminal state, such as state 1, with a fixed reward of 100 and no future states to consider, it serves as the base case for the recursion. This allows the values to propagate backward through the system of states.

Topic		Replies	Views
Discrepancy of return values for the same model Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	257	January 22, 2024
What is the difference between "State action value function" and "Bellman Equation"? Unsupervised Learning, Recommenders, Reinforcement week-module-3	6	550	February 20, 2023
Inconsistent definition for the Bellman equations Unsupervised Learning, Recommenders, Reinforcement week-module-3	10	730	August 18, 2022
Bellman Equation Unsupervised Learning, Recommenders, Reinforcement week-module-3	3	602	August 31, 2022
Reinforcement learning introduction quiz - helicopter return question Unsupervised Learning, Recommenders, Reinforcement week-module-3	2	483	September 10, 2022

Unsupervised Learning: Bellman Equation example looks incorrect

Related topics