Notes on the Bellman equation

kalbun · October 15, 2024, 8:55pm

I wonder if I am the only one who found difficulties in the part about the Bellman equation. My impression is that the lessons do a logical jump, moving from the Bellman optimality equation:

Q(s,a) = R(s) + γ * max in a’ of Q(s’,a’)

to Q-Learning equation found in the Lunar Lander lab:

Q(s,a) = Q(s,a) + α * [ R(s) + γ * max in a’ of Q(s’,a’) - Q(s,a) ]

After a long conversation with Mistral LLM (and a confirmation on authoritative sources), I finally understood that the first form applies only once you know exactly Q.
Q-learning starts from a random guess of Q and converges to its optimal value. During the learning process, the optimality equation doesn’t apply, thus the quantity in the square brackets is non zero. In facts, it is called temporal difference, TD, and represents an error that Q-learning attempts to zero.
Once the algorithm converges to the exact Q, the optimality equation becomes valid and TD becomes zero for all (a,s) pairs.

Maybe I just wrote a series of banalities, in that case please forgive me. If not, maybe these notes may help some fellow students to better understand this part.

SNaveenMathew · October 15, 2024, 11:53pm

Thank you for sharing, @kalbun. Your interpretation is correct, Bellman equation(s) are in some way a restatement of ‘dynamic programming’. Learning a more optimal policy vs. stopping the learning process is same as the ‘explore vs. exploit’ question. At the start of the learning process the policy is a random guess. There’s one key aspect most people overlook: from greedy vs. dynamic programming it’s known that taking a greedy action at each step forward may not lead to the most optimal solution - this is the reason behind adding randomness to the next action. During the learning process the policy (for simplicity, if all actions and states are finite) is guaranteed to keep improving in terms of total reward. Finally, once the most optimal policy is known, the fundamental principal behind dynamic programming applies: at any given point A the ‘path’ to the most optimal end solution Z (usually till perpetuity) is unique; also, for any given point B in the path between A-Z, the most optimal path from B to Z coincides with the most optimal path from A to Z that passes through B. Therefore, once the optimal policy is fully known the next action should be the ‘greedy’ action determined by the optimal policy.

Topic		Replies	Views
Unsupervised Learning: Bellman Equation example looks incorrect Unsupervised Learning, Recommenders, Reinforcement week-3	4	75	September 22, 2024
Inconsistent definition for the Bellman equations Unsupervised Learning, Recommenders, Reinforcement week-3	10	728	August 18, 2022
What helps the Neural Network in the Lunar Lander example improve? Unsupervised Learning, Recommenders, Reinforcement week-3	2	297	June 6, 2024
What is the difference between "State action value function" and "Bellman Equation"? Unsupervised Learning, Recommenders, Reinforcement week-3	6	547	February 20, 2023
Setting Q(s', a') to a Universal Value During Initialization Unsupervised Learning, Recommenders, Reinforcement week-3	1	430	December 10, 2024

Notes on the Bellman equation

Related topics