Confusion on Target Variable Deep Reinforcement

rmwkwok · August 17, 2022, 3:07am

First, above quoted is the purpose for having the target network - to avoid instabilities. We can think about it this way - this time, we train our Q network with a samples set A covering a certain states and actions, such learning does not affect just those states and actions, but all because every weights in the NN get updated. Next time, we train the NN with samples set B covering different states and actions, and by the same reason, learning about the states and actions in B can completely change the NN’s original behavior about the states and actions of A. If the change is too dramatic, it is a bad idea - this is a kind of instabilities we want to avoid.

To achieve that, we need to get involved something reluctant to change in our system design - which is the target Q Network. The Q-Network learns, but the Target Q-Network also learns, only that the latter learns in an indirect way:

It updates its weights by combining its weights and the updated weights of the Q-Network, and the ratio of this combination is controlled by a parameter \tau which is usually very very small, so that each time it remains most of itself but only take a tiny bit from the Q-Network. This makes the target Q-Network reluctant to change.

With this reluctance, the learning can become more stable and better preserve what’s been learnt.

In order to see for yourself, or get some experience of your own. on how useful having the target Q-Network is, you can experiment different values of \tau yourself. If you set it to 1, then the target Q-Network is effectively equal to the Q-Network, or you may say we are effectly abandoning having a separate, different Target Q-Network. So you can see how the learning will work out without the presence of a reluctant-to-change Target Q-Network. To do this, you may open the assignment, then don’t run any code, click “File” > “Open”, and open “util.py”, check out the update_target_network function, and you will see the update formula there, and you may adjust the \tau value to something very different (between 0 and 1, inclusive). Save your change, then go back to the notebook, and then run the code. Each time after you change the \tau value, you need to restart the kernel of your notebook once, and run codes from the top of the notebook.

I also strongly suggest you to try this out yourself, besides comparing the difference you mentioned, also see what it will look like at different sizes of \tau. It is a very good way to learn by having an expectation first, see the effect, and update your understanding when needed.

If you do try, I look forward to your sharing

Raymond

Topic		Replies	Views
Deep Reinforcement Learning Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	499	January 2, 2023
How does the Q-Learning Algorithm actually learn? Unsupervised Learning, Recommenders, Reinforcement week-module-3	18	555	December 5, 2023
Question about state value function learning algo Unsupervised Learning, Recommenders, Reinforcement week-module-3	4	520	April 19, 2023
Reinforcement learning - inizialization of Q Unsupervised Learning, Recommenders, Reinforcement week-module-3	9	550	February 15, 2023
Confused about reniforcement learning Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	226	March 26, 2024

Confusion on Target Variable Deep Reinforcement

Related topics