Reinforcement learning - inizialization of Q

in the lesson about “Learning the state-value function” we talk about the random initialization of Q which is then improved in subsequent iterations (either with the gradient Descent technique or with its mini Batch variant). My perplexity is related to wanting to understand what is meant by random initialization of Q.
Does it mean that it is:

  • a pure random number extracted from a distribution of type … with parameters …?
  • a random number but that somehow depend from s’ and/or a’ ?
    maybe an analytical little example could help me!
1 Like

In this example you would randomly assign Q a value. Most likely generating it via np.random.randn(). This helps ensure that you’re likely to explore all possible paths instead of always following one set path and ignoring one that may potentially be better.

Once you assign a random number then you’ll start exploring different paths and update the Q value to a new more accurate estimation.

1 Like

Thanks @RyanCarr .
But in the frist iteration (or when there is a casual assigment of Q) how can i be sure to explore and not exploit with a specific Policy if:

  1. the value of (s and) a are the inputs (or X) of the Neural Network and so I can’t choose strange or abnormal value for myself
  2. the value of Q is random so does not depend of a if not only for R(s) that is another input of NN.

At last, I’d like to understand if are there guidelines/strategies for choosing:

  • gamma
  • the parametrs of np.random.randn() (Is it really necessary that the distribution has zero mean and 1 variance or does it make sense to change? If so, what does it depend on?)

Thanks for your help, I’m fascinated by reinforcement learning

Hi @Gaetano_Caira ,

I’ve had the same exact confusion about the random initialization of Q and inspired by @RyanCarr response, I decided to try out the algorithm by hand on the Mars Rover example for when gamma = 0.5. It cleared up the confusion regarding what is meant by random initialization of Q. It seems you could indeed initialize Q to whatever values you want. Apologies for the long post… please bear with me.

Mars Rover e.g.:

First iteration
-You will see that I went out of my way to initialize Q(2,–>) and Q(3,–>) to something entirely different than the rest just to see what will happen.
-note Mars Rover only has small amounts of possible state-action combinations (8, excl. combinations with terminal states), so I decided to try them all every iteration.

Second iteration
-I think because we’re in a discrete environment, we can just update Q_orig = to the training set values, but probably need a NN to do this in continuous states (??). if someone could lemme know, that’ll be great.

Third iteration

Fourth iteration

Fifth… bear with me

Sixth and convergence

If you compare from lecture video about the Mars Rover optimal policy when gamma = 0.5 you get the same results.

Now in a continuous environment, there’s no way we can store infinite Q(s,a)'s so this algorithm won’t work (I suspect the Neural Network step will probably fix this?), but I still find this illustration to be helpful in understanding the random initialization of Q.

In any sort of complex system, finding the Q values is a big challenge.

Typically (for a Deep Q Network), the Q values themselves are learned by an NN that’s supported by a training set.

hello @Kelvin_Yim ,
I’m really grateful for sharing your reasoning and insight with me, of great value and I’ll try to give you my humble feedback on your doubt and 1 consideration.

  1. in the second iteration to update Q you don’t need the NN and you assume it’s because you are in the discrete. Actually this correct action you do is facilitated by the fact that you have only 8 possible combinations of s-a, if they were more you would need an analytical model (NN) to go from the exploration done in step 1 to the new initialization in step 2. Surely in the continuous case the combinations increase massively but also in the discrete case it could expand and for this you have to decide which combination s,a to investigate since you cannot investigate them all (exploration vs. exploitation trade-off)
  2. the convergence took place in just 6 steps because the initialization of Q was close to the final one, but at this point I doubt that there are heuristics to initialize Q cleverly.
    p.s. the iterative sequence described is not present in the course slides!:
    a) initialize/update Q
    b) performs random actions: s,a,R(s),s’
    c) create training set: x=s,a and y=Q(s,a)
    d) Build model NN

@Kelvin_Yim you are great :handshake:

1 Like

Hi Kelvin,

Im having a hard time understanding why the Q estimates will get better, if we are training the NN to learn random rewards.

In your examples, when you update the training set in Step2, you dont seem to use any real-life “learning”, as in, you did not even need to know the reward function here. The resultant convergence seems to have come about purely from the randomly initialized values.

I agree, the DQN training process is confusing. I’m re-watching the lectures and the lunar lander lab to see if I can pick up the clues.

Here’s what I got from the notebook.

The ‘y’ values that we want to train the Q_network don’t entirely exist. They exist partially because we know the R values (from the system reward). So we estimate them in total using a second neural network. This second network generates “y target”, and we use the “soft update” method to update its weights, so that it doesn’t move very quickly.

So essentially, the Q-network weights are constantly chasing the much-more-stable “y targets”, and eventually the whole thing settles down into a set of Q-network weights that will minimize the cost.

I think the information in the Notebook is easier to comprehend than the lectures, but that’s just my personal learning style.

I’m still studying the details, as I don’t feel I grasp the whole process yet.


But here is where the NN is able to also estimate the rewards for an action that has not happened.

Consider a much larger set of actions…say in a 30x30 maze. Even if we explore 40%, 50% or even higher number of times, there could still be states and actions that are never visited or visited just a few times (but not enough to average out and give an accurate reward value). This is where the NN, with its estimation power, gives us a leg up.

However we do not let the NN entirely take over with its estimations, rather, we constantly keep correcting it by also using the most recent set of rewards actually obtained for the various actions.