Reinforcement learning: How can you be sure the NN calculates the right thing?

Rickard · August 14, 2022, 8:07am

Hello everyone ! First of all, big thanks for the course and all the help I’ve received.
I have now finished the videos for the 3rd course and will soon start the practice lab. However, there are somethings in the algorithm I don’t understand.

So from what I understand:

you have the machine first do 10 000 random actions do get a training set

You then train a neural network to calculate Q and the neural network has to learn what action a from a state s that maximizes Q? But so the machine has to pair the different states with each other and find the optimal path? Is this done with back propagation or something? How can it figure it out?

Also in the video about the improved architecture (photo below), Andrew encourages using a model with 8 inputs, and then 64 units in 2 layers and 4 output layers. I might have forgotten some important concepts from the previous courses, but how can you be sure that it actually outputs the actions? And why 64 units, if there are 8 inputs and 4 actions? Doesn’t 32 make more sense in the first hidden layer?

As I said I might have forgotten important concepts, and it might also be that I need to learn more about the math and statistics behind the concepts, but right I feel a bit confused. I feel like it’s magic of some sort, even though I know that it’s all really logical.

I would really appreciate if someone could explain it to me! And there might probably be more people asking themselves the same questions.
once again, thanks for a great course!

rmwkwok · August 14, 2022, 8:41am

Hi Rickard, I will make my answer short.

We train a NN that learns/produces Q values for each action given the current state.
Then, we pick the action with max Q.

No. Each learning sample has only one state. It doesn’t learn the relationship between 2 states. The neural network calculates the value of an action only by considering the current state.

The 4 neurons in the output layers are responsible for the Q-values of the 4 actions respectively. The 4 neurons carry their meanings because only when the 4 neurons produce the right Q values, can the Neural Network’s loss be minimized. You can consider this as “the optimization process gives the 4 neurons their meanings”

This requires experiment to verify, we can make many more examples than 32 and 64. We need to do experiment, and compare the performance of the trained neural networks, before we can tell which one is better. 64 there is just an example.

Raymond

Topic		Replies	Views
Input to DQN in reinforcement learning Unsupervised Learning, Recommenders, Reinforcement week-module-2	2	485	May 16, 2023
Question about state value function learning algo Unsupervised Learning, Recommenders, Reinforcement week-module-3	4	520	April 19, 2023
Confused about how DQN works Unsupervised Learning, Recommenders, Reinforcement week-module-3	10	336	February 21, 2024
About 'Learning the state-value function' video in the Reinforcement Learning section Unsupervised Learning, Recommenders, Reinforcement week-module-3	2	511	October 4, 2022
Reinforcement Learning Unsupervised Learning, Recommenders, Reinforcement week-module-3	1	73	July 1, 2024

Reinforcement learning: How can you be sure the NN calculates the right thing?

Related topics