Unsupervised Learning : Week3 : Learning the state-value function

The “magic” works because the optimazing the loss function to minimize the error (MSE). It is pure maths + the algorithm.

The first function is the loss function (compute_loss) where it is calculated the difference between:

- y_targets from Bellman ecuation
y_targets = rewards + (gamma * max_qsa * (1 - done_vals))

‘max_qsa’ is the max of the ‘q_target_values’ that come from ‘target_q_network’

and

- q_values from ‘q_network’

At the second function ‘agent_learn’ is applied the descent gradient (derivative of loss function) to minimize the cost.

With this two functions reach the learning. The target network is to apply the ‘soft update’. You could implement this algorithm with one Q_network, but it would be unstable. But you can think with one network to reach better comprehension. All magic is explained for descent gradient + loss function with bellman ecuation, although the initial Weights are random, the loss function is optimized by the derivative (gradient). This “magic” working.

The code is from lab:

def compute_loss(experiences, gamma, q_network, target_q_network):

    # Unpack the mini-batch of experience tuples
    states, actions, rewards, next_states, done_vals = experiences

    q_target_values = target_q_network(next_states)

    # Compute max Q^(s,a)
    max_qsa = tf.reduce_max(q_target_values, axis=-1)

    # Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).

    y_targets = rewards + (gamma * max_qsa * (1 - done_vals))


    # Get the q_values and reshape to match y_targets
    q_values = q_network(states)

    q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),
                                                tf.cast(actions, tf.int32)], axis=1))

    # Compute the loss
    ### START CODE HERE ###
    loss = MSE(y_targets, q_values)
    ### END CODE HERE ###

    return loss

def agent_learn(experiences, gamma):
    """
    Updates the weights of the Q networks.

    Args:
      experiences: (tuple) tuple of ["state", "action", "reward", "next_state", "done"] namedtuples
      gamma: (float) The discount factor.

    """

    # Calculate the loss
    with tf.GradientTape() as tape:
        loss = compute_loss(experiences, gamma, q_network, target_q_network)

    # Get the gradients of the loss with respect to the weights.
    gradients = tape.gradient(loss, q_network.trainable_variables)

    # Update the weights of the q_network.
    optimizer.apply_gradients(zip(gradients, q_network.trainable_variables))

    # update the weights of target q_network
    utils.update_target_network(q_network, target_q_network)
1 Like