Hello everyone
I’am actually a little bit confused about the implementation of the DQN algorithm, I understand all the steps of the algorithm but there’s somthing confusing me which is how we get to calculate the targets y :
# Unpack the mini-batch of experience tuples.
states, actions, rewards, next_states, done_vals = experiences
# Compute max Q^(s,a).
max_qsa = tf.reduce_max(target_q_network(next_states), axis=-1)
# Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).
How does this exactly works ? and how we’ve managed to compute max_qsa ?
namely a call to target_q_network and tf.reduce_max.
I assume you know what target_q_network is because it is implemented by you, and by call such neural network like a function with an input next_states, it does a forward propagation to compute the output of the neural network, which is the Q(s,a) of all possible actions.