The “magic” works because the optimazing the loss function to minimize the error (MSE). It is pure maths + the algorithm.

The first function is the loss function (compute_loss) where it is calculated **the difference between:**

**- y_targets** from Bellman ecuation

y_targets = rewards + (gamma * max_qsa * (1 - done_vals))

‘max_qsa’ is the max of the ‘q_target_values’ that come from ‘target_q_network’

and

**- q_values** from ‘q_network’

At the second function ‘agent_learn’ is applied the descent gradient (derivative of loss function) to minimize the cost.

With this two functions reach the learning. The target network is to apply the ‘soft update’. You could implement this algorithm with one Q_network, but it would be unstable. **But you can think with one network to reach better comprehension.** All magic is explained for **descent gradient + loss function with bellman ecuation**, although the initial Weights are random, the loss function is optimized by the derivative (gradient). This “magic” working.

The code is from lab:

```
def compute_loss(experiences, gamma, q_network, target_q_network):
# Unpack the mini-batch of experience tuples
states, actions, rewards, next_states, done_vals = experiences
q_target_values = target_q_network(next_states)
# Compute max Q^(s,a)
max_qsa = tf.reduce_max(q_target_values, axis=-1)
# Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).
y_targets = rewards + (gamma * max_qsa * (1 - done_vals))
# Get the q_values and reshape to match y_targets
q_values = q_network(states)
q_values = tf.gather_nd(q_values, tf.stack([tf.range(q_values.shape[0]),
tf.cast(actions, tf.int32)], axis=1))
# Compute the loss
### START CODE HERE ###
loss = MSE(y_targets, q_values)
### END CODE HERE ###
return loss
def agent_learn(experiences, gamma):
"""
Updates the weights of the Q networks.
Args:
experiences: (tuple) tuple of ["state", "action", "reward", "next_state", "done"] namedtuples
gamma: (float) The discount factor.
"""
# Calculate the loss
with tf.GradientTape() as tape:
loss = compute_loss(experiences, gamma, q_network, target_q_network)
# Get the gradients of the loss with respect to the weights.
gradients = tape.gradient(loss, q_network.trainable_variables)
# Update the weights of the q_network.
optimizer.apply_gradients(zip(gradients, q_network.trainable_variables))
# update the weights of target q_network
utils.update_target_network(q_network, target_q_network)
```