Question about state value function learning algo

Hello, I’m unclear about how the NN learns to predict Q(s, a). In the lecture, Prof Ng states that:

" After creating these 10,000 training examples we’ll have training examples x1, y1 through x10,000, y10,000. We’ll train a neural network and I’m going to call the new neural network Q new, such that Q new of s, a learns to approximate y."

The NN’s parameters are randomly initialized and output some value Q(s,a). So to train the network, we need to calculate the cost, which will be sum of squared difference: (y-Q(s,a)). But how do we know what the value of y is for each training example?

1 Like

After listening to the lecture again, I gather that the target y is calculated using Bellman equation: y = R(s) + gamma*max[Q(s’, a’)]. So, y is essentially R(s) + random number (rand1). And Qnew(s,a) is also random number (rand2). Now, we calculate the loss as sum of square difference: (y-Qnew), which is essentially (R(s)+rand1-rand2). Is my understanding correct?

If so, I’m still not clear about how this would actually work towards getting us close to the actual value of Q(s,a) over multiple training iterations. How does knowing R(s) lead us close to the actual target value?

Dear karko2tr,
Welcome to the Discourse community and thank you for asking this question. In my reply, I will do my best to help you figure out your issue.

Your understanding is partially correct. The target y is calculated using the Bellman equation, which estimates the expected future reward for taking action a in state s. The Bellman equation is defined as:
y = R(s) + γ * max(Q(s’, a’))
where R(s) is the immediate reward obtained by taking action a in state s, s’ is the next state, a’ is the next action, γ is the discount factor, and Q is the Q-value function. The Q-value function is approximated by a neural network, which takes the state-action pair (s, a) as input and outputs the corresponding Q-value. The neural network is trained using stochastic gradient descent to minimize the mean squared error between the predicted Q-value and the target Q-value y.

However, the value of Qnew(s,a) is not random. It is the output of the neural network for the state-action pair (s, a). The loss function is the mean squared error between the predicted Q-value Qnew(s,a) and the target Q-value y. By minimizing the loss function, the neural network learns to approximate the Q-value function, which represents the expected future reward for taking action a in state s.

Knowing R(s) alone does not lead us close to the actual target value. The Bellman equation takes into account the expected future reward, which depends on the Q-value function. By iteratively updating the Q-value function using the Bellman equation and training the neural network to approximate the Q-value function, the agent learns to make optimal decisions in the environment. Over multiple training iterations, the Q-value function becomes more accurate, and the agent learns to make better decisions.

I am hoping that my reply has been helpful to you. Feel free to reply, I am here to help you to the best of my abilities.
Best,
Can Koz

1 Like

Hi @karko2tr,

Besides @canxkoz’s excellent explanation, let’s look at a specific example.

Before we start, let’s reiterate your sample and make an assumption upon that. A training sample includes a state, an action, a reward, and a next state, or S, A, R, S’ in symbols.

  1. y_target for S, A = R + gamma * max_A’_Q(S’, A’), as you described, or R + rand1
  2. y_predict for S, A = Q(S, A) = rand2.
  3. The assumption is that, after training, the y_predict on the same training sample (S, A, R, S’) will equal to R + rand1. Ok?

Now comes the specific example.

Let’s say we obtained a series of samples through a simulated robot:

S1, A1, R1, S2, A2, R2, S3, A3, R3, S4, A4, R4, S5.

Here, we have 4 samples:

  1. S1, A1, R1, S2
  2. S2, A2, R2, S3
  3. S3, A3, R3, S4
  4. S4, A4, R4, S5

By our assumption, after training, the following predictions are more realistic:

  1. y_predict for S1, A1 = Q(S1, A1) = R1 + gamma * Q(S2, A2) (assume A2 is still the best action)
  2. y_predict for S2, A2 = Q(S2, A2) = R2 + gamma * Q(S3, A3) (assume A3 is still the best action)
  3. y_predict for S3, A3 = Q(S3, A3) = R3 + gamma * Q(S4, A4) (assume A4 is still the best action)
  4. y_predict for S4, A4 = Q(S4, A4) = R4 + gamma * max_A’_Q(S5, A’) = R4 + gamma * rand

Now, what can these equations tell us? If we substitute equations, we get

Q(S1, A1) = R1 + gamma * ( R2 + gamma * (R3 + gamma * (R4 + gamma * rand) ) ), or,

Q(S1, A1) = R1 + (gamma * R2) + (gamma^2 * R3) + (gamma^3 * R4) + (gamma^4 * rand)


The Take-away

If all of our assumptions are valid, now, our Q-network is able to produce a Q(S1, A1) as a summation of 4 concrete values and one rather random value. Particularly, that only random value is even largely penalized by gamma^4 which makes it quite INSIGNIFICANT. In other words, now, our Q-network produces a prediction that included an insignificant portion of random stuff.


Of course, you can argue against my assumptions. For example, I said,

assume A2 is still the best action

and you might argue that what if A2 is not still the best action? Then I would say, even if A2’ was the actual best action, it only means that we just needed another sample S2, A2’, R2’, S3’ for us to train the Q-network to get a good Q(S2, A2’). Therefore, if we have a sufficient number of samples to train the Q-network, my assumption will be stronger.

You might also argue another assumption of mine:

after training, the y_predict on the same training sample (S, A, R, S’) will equal to R + rand1

This is a matter of we have a good architecture and training algorithm for our Q-network.

Look, in above I am only providing a way to rationalize why it can work, but not why it must work (we can keep attacking the assumptions). I also suggest you to finish the week’s assignment to see it for yourself that it can work.

Cheers,
Raymond

Thank you @canxkoz and @rmwkwok . Appreciate your detailed responses. Reading through both of your explanations and @rmwkwok 's intuitive example has helped me understand this better. I will continue with the rest of the videos and do the assignment.