Setting Q(s', a') to a Universal Value During Initialization

This is about the Belman equation discussed in Week 3 of the Unsupervised Learning Course. Given Q(s, a) = R(s) + γ max(Q(s’, a’)), is Q(s’, a’) set to one value for all training examples during the first run?

It may be calculated backward after reaching the terminal state. Then, the estimates of each sample Q(s’, a’) may be changed before starting a new cycle of generating new training examples. Maybe it is a bet that the rewards will eventually cover up the randomness of the Q(s’, a’).

A Stack Exchange article suggested taking the aggregated value of all rewards including the terminal state. That can be used to set Q(s’, a’).

Another article suggested to set all Q(s’, a’) to zero or -1.

This article from Purdue University suggested building a Q-table and then looking up the reward: