Deep Q-Learning Algorithm with Experience Replay

Hello everyone
I’am actually a little bit confused about the implementation of the DQN algorithm, I understand all the steps of the algorithm but there’s somthing confusing me which is how we get to calculate the targets y :

# Unpack the mini-batch of experience tuples.
    states, actions, rewards, next_states, done_vals = experiences
    
    # Compute max Q^(s,a).
    max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)
    
    # Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).

How does this exactly works ? and how we’ve managed to compute max_qsa ?

Hello @ako,

This line has 2 parts:

 max_qsa = tf.reduce_max(
    target_q_network(next_states), 
    axis=-1
)

namely a call to target_q_network and tf.reduce_max.

I assume you know what target_q_network is because it is implemented by you, and by call such neural network like a function with an input next_states, it does a forward propagation to compute the output of the neural network, which is the Q(s,a) of all possible actions.

I suggest you to run

qsa = target_q_network(next_states)

print(next_states.shape)
print(qsa.shape)
print(next_states)
print(qsa)

to check out yourself what they look like.

As for tf.reduce_max, please check out its documentation for examples and explanations.

If you still have questions, please share with me your understanding so that I know what’s unclear ;).

Cheers,
Raymond

PS: I am removing the part for the exercise since sharing assignment code isn’t allowed.