Reinforcement learning: How can you be sure the NN calculates the right thing?

Hello everyone ! First of all, big thanks for the course and all the help I’ve received.
I have now finished the videos for the 3rd course and will soon start the practice lab. However, there are somethings in the algorithm I don’t understand.

So from what I understand:

you have the machine first do 10 000 random actions do get a training set

You then train a neural network to calculate Q and the neural network has to learn what action a from a state s that maximizes Q? But so the machine has to pair the different states with each other and find the optimal path? Is this done with back propagation or something? How can it figure it out?

Also in the video about the improved architecture (photo below), Andrew encourages using a model with 8 inputs, and then 64 units in 2 layers and 4 output layers. I might have forgotten some important concepts from the previous courses, but how can you be sure that it actually outputs the actions? And why 64 units, if there are 8 inputs and 4 actions? Doesn’t 32 make more sense in the first hidden layer?

As I said I might have forgotten important concepts, and it might also be that I need to learn more about the math and statistics behind the concepts, but right I feel a bit confused. I feel like it’s magic of some sort, even though I know that it’s all really logical.

I would really appreciate if someone could explain it to me! And there might probably be more people asking themselves the same questions.
once again, thanks for a great course!

Hi Rickard, I will make my answer short.

We train a NN that learns/produces Q values for each action given the current state.
Then, we pick the action with max Q.

No. Each learning sample has only one state. It doesn’t learn the relationship between 2 states. The neural network calculates the value of an action only by considering the current state.

The 4 neurons in the output layers are responsible for the Q-values of the 4 actions respectively. The 4 neurons carry their meanings because only when the 4 neurons produce the right Q values, can the Neural Network’s loss be minimized. You can consider this as “the optimization process gives the 4 neurons their meanings”

This requires experiment to verify, we can make many more examples than 32 and 64. We need to do experiment, and compare the performance of the trained neural networks, before we can tell which one is better. 64 there is just an example.