Unsupervised Learning : Week3 : Learning the state-value function

Luis.BR · October 30, 2023, 5:56pm

The “magic” works because the optimazing the loss function to minimize the error (MSE). It is pure maths + the algorithm.

The first function is the loss function (compute_loss) where it is calculated the difference between:

- y_targets from Bellman ecuation
y_targets = rewards + (gamma * max_qsa * (1 - done_vals))

‘max_qsa’ is the max of the ‘q_target_values’ that come from ‘target_q_network’

and

- q_values from ‘q_network’

At the second function ‘agent_learn’ is applied the descent gradient (derivative of loss function) to minimize the cost.

With this two functions reach the learning. The target network is to apply the ‘soft update’. You could implement this algorithm with one Q_network, but it would be unstable. But you can think with one network to reach better comprehension. All magic is explained for descent gradient + loss function with bellman ecuation, although the initial Weights are random, the loss function is optimized by the derivative (gradient). This “magic” working.

The code is from lab:

def compute_loss(experiences, gamma, q_network, target_q_network):

    # Unpack the mini-batch of experience tuples
    states, actions, rewards, next_states, done_vals = experiences

    q_target_values = target_q_network(next_states)

    # Compute max Q^(s,a)
    max_qsa = tf.reduce_max(q_target_values, axis=-1)

    # Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).

    y_targets = rewards + (gamma * max_qsa * (1 - done_vals))


    # Get the q_values and reshape to match y_targets
    q_values = q_network(states)

    q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),
                                                tf.cast(actions, tf.int32)], axis=1))

    # Compute the loss
    ### START CODE HERE ###
    loss = MSE(y_targets, q_values)
    ### END CODE HERE ###

    return loss

def agent_learn(experiences, gamma):
    """
    Updates the weights of the Q networks.

    Args:
      experiences: (tuple) tuple of ["state", "action", "reward", "next_state", "done"] namedtuples
      gamma: (float) The discount factor.

    """

    # Calculate the loss
    with tf.GradientTape() as tape:
        loss = compute_loss(experiences, gamma, q_network, target_q_network)

    # Get the gradients of the loss with respect to the weights.
    gradients = tape.gradient(loss, q_network.trainable_variables)

    # Update the weights of the q_network.
    optimizer.apply_gradients(zip(gradients, q_network.trainable_variables))

    # update the weights of target q_network
    utils.update_target_network(q_network, target_q_network)

Topic		Replies	Views
Question about state value function learning algo Unsupervised Learning, Recommenders, Reinforcement week-3	4	520	April 19, 2023
Confusion on Target Variable Deep Reinforcement Unsupervised Learning, Recommenders, Reinforcement week-3	28	933	September 15, 2022
Deep Q-Learning Algorithm with Experience Replay Unsupervised Learning, Recommenders, Reinforcement week-3	1	510	November 6, 2022
How does the neural network compute the Q function Unsupervised Learning, Recommenders, Reinforcement week-3	3	494	March 21, 2023
Unsupervised Learning: Bellman Equation example looks incorrect Unsupervised Learning, Recommenders, Reinforcement week-3	4	74	September 22, 2024

Unsupervised Learning : Week3 : Learning the state-value function

Related topics