Setting Q(s', a') to a Universal Value During Initialization

Joachim_Rives · July 25, 2023, 8:33am

This is about the Belman equation discussed in Week 3 of the Unsupervised Learning Course. Given Q(s, a) = R(s) + γ max(Q(s’, a’)), is Q(s’, a’) set to one value for all training examples during the first run?

It may be calculated backward after reaching the terminal state. Then, the estimates of each sample Q(s’, a’) may be changed before starting a new cycle of generating new training examples. Maybe it is a bet that the rewards will eventually cover up the randomness of the Q(s’, a’).

A Stack Exchange article suggested taking the aggregated value of all rewards including the terminal state. That can be used to set Q(s’, a’).

Another article suggested to set all Q(s’, a’) to zero or -1.

This article from Purdue University suggested building a Q-table and then looking up the reward:

conscell · December 10, 2024, 1:03am

In tabular RL, the action-value function Q(s, a) is initialized arbitrarily, often to zeros or small random values for all (s, a) pairs. This provides a starting point for the iterative process. For the optimal action-value function Q^∗(s, a), we can express the Bellman optimality equation as follows:

Q^∗(s, a)=R + \gamma \max_{a'} Q^∗(s', a'),

where s' is the next state. This identity is called the “principle of dynamic programming" and suggests that the remainder of an optimal trajectory is also optimal. This principle can be turned into an algorithm for finding the optimal action-value function called value iteration. The key idea behind value iteration is to think of this identity as a set of constraints that tie together Q^∗ across states and actions. At the i-th iteration, the algorithm updates the action-value function as:

Q_{i + 1}(s, a)=R + \gamma \max_{a'} Q_i(s', a').

This algorithm guarantees that the estimated action-value function converges to the optimal action-value function irrespective of the initialization Q_0, \displaystyle Q^∗(s, a) = \lim_{i \rightarrow \infty} Q_i (s, a) for all (s, a). You can find more details here.

Topic		Replies	Views
What is the difference between "State action value function" and "Bellman Equation"? Unsupervised Learning, Recommenders, Reinforcement week-module-3	6	560	February 20, 2023
How can we compute the 'optimal way' in a state action value function? Unsupervised Learning, Recommenders, Reinforcement week-module-1	1	272	December 4, 2023
Optimal policy and the Q left, Q right values Unsupervised Learning, Recommenders, Reinforcement week-module-3	5	384	August 14, 2023
Reinforcement learning - inizialization of Q Unsupervised Learning, Recommenders, Reinforcement week-module-3	9	551	February 15, 2023
Bellman Equation Unsupervised Learning, Recommenders, Reinforcement week-module-3	3	604	August 31, 2022

Setting Q(s', a') to a Universal Value During Initialization

Related topics