I was wondering why we are using GradientTape to find the partial derivatives of the loss with respect to all the weights of the Q-network and then updating the weights. Couldn’t we have instead done the usual model.compile, model.fit process to train the Q-network and then done a soft update for target Q-network?
We do this because the introductory-level TensorFlow layers do not know how to perform the state-action updates and interact with the environment.
There are TensorFlow features for reinforcement learning, but they would hide the details for how the process works. As this course tries to teach basic principles, it doesn’t use the built-in methods.
But in the training the Q-network step all we are doing is feeding it states and asking it to learn to predict Q value for all 4 actions. Why would TensorFlow need to interact with the environment? We just do the interactions ourselves, collect (S,A,R,S’) tuples, create training data, and feed this to the NN and tune the weights to predict decent Q-values. So you could just do model.compile and model.fit just like any other NN with training data. I don’t see how this is different from any other NN training.
You can’t perform a soft update after fully training the Q-network e.g a trained model after fitting it, but it should be done at every C time steps. That’s why we utilize GradientTape() during each iteration to record the gradients and update the parameters for both the Q-network and target Q-network. This method is commonly known as a custom training loop.
I’m referring to the agent_learn(experiences, gamma) function. We already have data pertaining to the agent’s experiences. So why can’t we train it like we usually do? Data collection aside, once we have the data to train on, why should things be any different?
model.fit() function does not inherently include soft updates, and we do not have control over its training loop. That is why we use a custom agent_learn() function to update the parameters where we defined soft updates.
I think I get it now. So, every C timesteps we just do one update of the Q-network weights, that is, just one gradient descent step. Then a soft update on target Q-network. I see why we can’t use model.fit now.
As another related question - we have max_num_timesteps = 1000. So we take 1000 actions per episode (possibly less if we reach terminal state before 1000 actions). Instead of updating the weights by performing one gradient descent step every C (4) timesteps why don’t we just train the Q-network every, say, 200 timesteps using model.fit (on about 100 or 1000 data points from experience replay)? Then using these new learned weights, perform a soft update on target Q-network (w- = 0.99w- + 0.01w). Is this a matter of design choice? Was the custom training loop used because it leads to faster convergence and less oscillations/instability compared to doing it the other way?
Exactly, But not the gradient descent step, Adam rule instead.
It may depend, but what I think is that in that scenario, you might also be updating the target Q network less frequently by performing soft updates after a certain number of time steps, which could result in slower convergence.
Optimization algorithms, such as Adam, in the lab, and the right choice of learning rate, are responsible for this, whereas a custom training loop is used to define how the agent should interact with the environment and how to carry out the training process.