# Unsupervised Learning : Week3 : Learning the state-value function

I don’t understand how the value predicted by the first neural network(with guessed parameters) can be taken to be a proper output for a given state. I feel that its unreliable, and that we are using these predicted values to train another model in order to guess the optimal Q function, kinda confuses me. I want to get some ideas on why this approach makes sense.

y = R(s) + γ * max(Q(s’, a’))
And I can’t make sense of how the network gets the max(Q(s’,a’)) while trying to estimate Q(s,a)

1 Like

I’m with you, the details of this method are highly mysterious. I do not comprehend it well enough to answer questions on it.

I’ll check around for some other useful information links.

1 Like

The “magic” works because the optimazing the loss function to minimize the error (MSE). It is pure maths + the algorithm.

The first function is the loss function (compute_loss) where it is calculated the difference between:

- y_targets from Bellman ecuation
y_targets = rewards + (gamma * max_qsa * (1 - done_vals))

‘max_qsa’ is the max of the ‘q_target_values’ that come from ‘target_q_network’

and

- q_values from ‘q_network’

At the second function ‘agent_learn’ is applied the descent gradient (derivative of loss function) to minimize the cost.

With this two functions reach the learning. The target network is to apply the ‘soft update’. You could implement this algorithm with one Q_network, but it would be unstable. But you can think with one network to reach better comprehension. All magic is explained for descent gradient + loss function with bellman ecuation, although the initial Weights are random, the loss function is optimized by the derivative (gradient). This “magic” working.

The code is from lab:

``````def compute_loss(experiences, gamma, q_network, target_q_network):

# Unpack the mini-batch of experience tuples
states, actions, rewards, next_states, done_vals = experiences

q_target_values = target_q_network(next_states)

# Compute max Q^(s,a)
max_qsa = tf.reduce_max(q_target_values, axis=-1)

# Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).

y_targets = rewards + (gamma * max_qsa * (1 - done_vals))

# Get the q_values and reshape to match y_targets
q_values = q_network(states)

q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),
tf.cast(actions, tf.int32)], axis=1))

# Compute the loss
### START CODE HERE ###
loss = MSE(y_targets, q_values)
### END CODE HERE ###

return loss

def agent_learn(experiences, gamma):
"""
Updates the weights of the Q networks.

Args:
experiences: (tuple) tuple of ["state", "action", "reward", "next_state", "done"] namedtuples
gamma: (float) The discount factor.

"""

# Calculate the loss
loss = compute_loss(experiences, gamma, q_network, target_q_network)

# Get the gradients of the loss with respect to the weights.

# Update the weights of the q_network.

# update the weights of target q_network
utils.update_target_network(q_network, target_q_network)
``````
1 Like

I think the questions come from the use and training of the Q network itself.

1 Like

Thank you

1 Like

Thank you for the insights. I can understand the working of the neural network and how its implemented in code. I’ve reviewed the lectures as well, what puzzles me is that we start off with a completely random Q function(we use that when we calculate the Bellman equation) and we train the neural network using this Y(given by the Bellman equation, using the random Q function). The neural network comes up with a good Q function estimate but its based on the random Q function, can’t really wrap my head around how we use a random Q function to quantify something that directly affects the Y value.

1 Like

Yeah…the neural network comes up with a good Q function estimate but its based on the random Q function, can’t really wrap my head around how we use a random Q function to quantify something that directly affects the Y value.

1 Like

Yes, in my head, I can intuitively understand how it works. Starting from the ‘Lunar Lander’ laboratory code. The Q-network begins with random weights, but once the training starts, it begins to optimize itself to achieve the lowest error with respect to the randomly obtained average reward value. I say “randomly” because the algorithm starts with completely random actions of the agent, and it also takes 64 random experiences from the buffer of 100,000, which, in statistics, would be a sample of the total, from which we calculate the mean internally in Tensorflow.

Remember that the ‘loss’ function is used within another function, the ‘cost’ function, from which its derivative is extracted. Within this derivative function, there is a division of 1/m, resulting in the mean value, where m is the total number of elements, in this case, 64. Within this random behavior of the agent, the neural network is optimized towards the mean reward that is generated within the buffer as the two main loops of the algorithm progress, which are 2000 * 1000 (num_episodes * max_num_steps). The network selects the best possible action, which is obtained with the function “max(a’, w-)” in each executed training session.

In each training run, there are average rewards of all kinds. Some samples of 64 may not result in a successful landing, while others may contain crash landings, correct landings, or cases where the average reward is better because the agent has stayed closer to the correct trajectory or because the speed is slower.

In conclusion, the network adjusts itself to reach the mean of the random samples.

1 Like