Setting Q(s', a') to a Universal Value During Initialization

This is about the Belman equation discussed in Week 3 of the Unsupervised Learning Course. Given Q(s, a) = R(s) + γ max(Q(s’, a’)), is Q(s’, a’) set to one value for all training examples during the first run?

It may be calculated backward after reaching the terminal state. Then, the estimates of each sample Q(s’, a’) may be changed before starting a new cycle of generating new training examples. Maybe it is a bet that the rewards will eventually cover up the randomness of the Q(s’, a’).

A Stack Exchange article suggested taking the aggregated value of all rewards including the terminal state. That can be used to set Q(s’, a’).

Another article suggested to set all Q(s’, a’) to zero or -1.

This article from Purdue University suggested building a Q-table and then looking up the reward:

In tabular RL, the action-value function Q(s, a) is initialized arbitrarily, often to zeros or small random values for all (s, a) pairs. This provides a starting point for the iterative process. For the optimal action-value function Q^∗(s, a), we can express the Bellman optimality equation as follows:

Q^∗(s, a)=R + \gamma \max_{a'} ​Q^∗(s', a'),

where s' is the next state. This identity is called the “principle of dynamic programming" and suggests that the remainder of an optimal trajectory is also optimal. This principle can be turned into an algorithm for finding the optimal action-value function called value iteration. The key idea behind value iteration is to think of this identity as a set of constraints that tie together Q^∗ across states and actions. At the i-th iteration, the algorithm updates the action-value function as:

Q_{i + 1}(s, a)=R + \gamma \max_{a'} ​Q_i(s', a').

This algorithm guarantees that the estimated action-value function converges to the optimal action-value function irrespective of the initialization Q_0, \displaystyle Q^∗(s, a) = \lim_{i \rightarrow \infty} Q_i (s, a) for all (s, a). You can find more details here.