Hi , in the DQN first we initialize the parameters randomly to guess the Q(s,a) ,
then after guessing it we will use it as Y , to train the Qnew on ,
since we first initliaze the Q randomly , it is quite possbile that Y we got is completly wrong so ,
we train on a wrong dataset ?
Hello @Radouane_BEY_OMAR,
In short, besides the input from DQN which is random guess initially, we also have the true input from the environment which is the Reward. Note that in the assignment of the week, we have the following equation
The overall idea here is that, we start with some random Q Network (QN) and random Target Q Network (TQN) which are incorrect at first, but through continuous injection of the correct information of Reward from the environment, the hope is that our QN and TQN will both approach to be useful if not to be the truth.
It might be difficult to accept this because we have always been told that a supervised learning is based on correct labels, but this is reinforcement learning, and it is different because we do not have any label in advance, but signals (or Rewards) from the environment that we learn as we go.
The week’s assignment is a very good starting point for you to build up some belief in such approach. You will see for yourself how a lunar lander starts from some random networks and eventually learns to land properly.
Happy New Year, and cheers,
Raymond