Why use GradientTape in the DQN practice lab?

Aditya_Ranganath · May 23, 2023, 1:29am

Hi All,

I was wondering why we are using GradientTape to find the partial derivatives of the loss with respect to all the weights of the Q-network and then updating the weights. Couldn’t we have instead done the usual model.compile, model.fit process to train the Q-network and then done a soft update for target Q-network?

Cheers,
Adi

TMosh · May 23, 2023, 4:32am

We do this because the introductory-level TensorFlow layers do not know how to perform the state-action updates and interact with the environment.

There are TensorFlow features for reinforcement learning, but they would hide the details for how the process works. As this course tries to teach basic principles, it doesn’t use the built-in methods.

Aditya_Ranganath · May 23, 2023, 4:48am

Thanks for your reply @TMosh !

But in the training the Q-network step all we are doing is feeding it states and asking it to learn to predict Q value for all 4 actions. Why would TensorFlow need to interact with the environment? We just do the interactions ourselves, collect (S,A,R,S’) tuples, create training data, and feed this to the NN and tune the weights to predict decent Q-values. So you could just do model.compile and model.fit just like any other NN with training data. I don’t see how this is different from any other NN training.

Looking forward to your response!

Cheers,
Adi

TMosh · May 23, 2023, 4:51am

That would be a brute-force method that would create a gigantic training set for any sort of complex behavior.

Mujassim_Jamal · May 23, 2023, 4:56am

Didn’t you read this?

You can’t perform a soft update after fully training the Q-network e.g a trained model after fitting it, but it should be done at every C time steps. That’s why we utilize GradientTape() during each iteration to record the gradients and update the parameters for both the Q-network and target Q-network. This method is commonly known as a custom training loop.

Aditya_Ranganath · May 23, 2023, 5:03am

Hi @TMosh

I’m referring to the agent_learn(experiences, gamma) function. We already have data pertaining to the agent’s experiences. So why can’t we train it like we usually do? Data collection aside, once we have the data to train on, why should things be any different?

Aditya_Ranganath · May 23, 2023, 5:16am

Hi @Mujassim_Jamal

Right, we fit the the model every C time steps correct? So every C time steps why can’t we do a model.fit with the mini batch sample data?

Mujassim_Jamal · May 23, 2023, 5:54am

model.fit() function does not inherently include soft updates, and we do not have control over its training loop. That is why we use a custom agent_learn() function to update the parameters where we defined soft updates.

Tom already pointed this out in his first reply.

Aditya_Ranganath · May 23, 2023, 1:55pm

Hi @Mujassim_Jamal - thanks for your reply.

I think I get it now. So, every C timesteps we just do one update of the Q-network weights, that is, just one gradient descent step. Then a soft update on target Q-network. I see why we can’t use model.fit now.

As another related question - we have max_num_timesteps = 1000. So we take 1000 actions per episode (possibly less if we reach terminal state before 1000 actions). Instead of updating the weights by performing one gradient descent step every C (4) timesteps why don’t we just train the Q-network every, say, 200 timesteps using model.fit (on about 100 or 1000 data points from experience replay)? Then using these new learned weights, perform a soft update on target Q-network (w- = 0.99w- + 0.01w). Is this a matter of design choice? Was the custom training loop used because it leads to faster convergence and less oscillations/instability compared to doing it the other way?

Thanks,
Adi

Mujassim_Jamal · May 23, 2023, 4:45pm

Exactly, But not the gradient descent step, Adam rule instead.

It may depend, but what I think is that in that scenario, you might also be updating the target Q network less frequently by performing soft updates after a certain number of time steps, which could result in slower convergence.

Optimization algorithms, such as Adam, in the lab, and the right choice of learning rate, are responsible for this, whereas a custom training loop is used to define how the agent should interact with the environment and how to carry out the training process.

Aditya_Ranganath · May 23, 2023, 6:14pm

Thanks @Mujassim_Jamal . I get it now!

Topic		Replies	Views
Reinforcement Learning Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	72	July 1, 2024
Tried to make a Deep Q-learning script from scratch using tensorflow Unsupervised Learning, Recommenders, Reinforcement week-module-4	5	193	May 30, 2024
Don't understand why we use q_netword & target_q_network Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	357	September 19, 2023
How does tf.GradientTape() work? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	584	November 23, 2021
Soft update in Deep Q network Unsupervised Learning, Recommenders, Reinforcement week-module-3	3	806	August 15, 2023

Why use GradientTape in the DQN practice lab?

Related topics