I wonder if I am the only one who found difficulties in the part about the Bellman equation. My impression is that the lessons do a logical jump, moving from the Bellman optimality equation:
Q(s,a) = R(s) + γ * max in a’ of Q(s’,a’)
to Q-Learning equation found in the Lunar Lander lab:
Q(s,a) = Q(s,a) + α * [ R(s) + γ * max in a’ of Q(s’,a’) - Q(s,a) ]
After a long conversation with Mistral LLM (and a confirmation on authoritative sources), I finally understood that the first form applies only once you know exactly Q.
Q-learning starts from a random guess of Q and converges to its optimal value. During the learning process, the optimality equation doesn’t apply, thus the quantity in the square brackets is non zero. In facts, it is called temporal difference, TD, and represents an error that Q-learning attempts to zero.
Once the algorithm converges to the exact Q, the optimality equation becomes valid and TD becomes zero for all (a,s) pairs.
Maybe I just wrote a series of banalities, in that case please forgive me. If not, maybe these notes may help some fellow students to better understand this part.
Thank you for sharing, @kalbun. Your interpretation is correct, Bellman equation(s) are in some way a restatement of ‘dynamic programming’. Learning a more optimal policy vs. stopping the learning process is same as the ‘explore vs. exploit’ question. At the start of the learning process the policy is a random guess. There’s one key aspect most people overlook: from greedy vs. dynamic programming it’s known that taking a greedy action at each step forward may not lead to the most optimal solution - this is the reason behind adding randomness to the next action. During the learning process the policy (for simplicity, if all actions and states are finite) is guaranteed to keep improving in terms of total reward. Finally, once the most optimal policy is known, the fundamental principal behind dynamic programming applies: at any given point A the ‘path’ to the most optimal end solution Z (usually till perpetuity) is unique; also, for any given point B in the path between A-Z, the most optimal path from B to Z coincides with the most optimal path from A to Z that passes through B. Therefore, once the optimal policy is fully known the next action should be the ‘greedy’ action determined by the optimal policy.