Hello everyone ! First of all, big thanks for the course and all the help I’ve received.

I have now finished the videos for the 3rd course and will soon start the practice lab. However, there are somethings in the algorithm I don’t understand.

So from what I understand:

you have the machine first do 10 000 random actions do get a training set

You then train a neural network to calculate Q and the neural network has to learn what action **a** from a state **s** that maximizes Q? But so the machine has to pair the different states with each other and find the optimal path? Is this done with back propagation or something? How can it figure it out?

Also in the video about the improved architecture (photo below), Andrew encourages using a model with 8 inputs, and then 64 units in 2 layers and 4 output layers. I might have forgotten some important concepts from the previous courses, but how can you be sure that it actually outputs the actions? And why 64 units, if there are 8 inputs and 4 actions? Doesn’t 32 make more sense in the first hidden layer?

As I said I might have forgotten important concepts, and it might also be that I need to learn more about the math and statistics behind the concepts, but right I feel a bit confused. I feel like it’s magic of some sort, even though I know that it’s all really logical.

I would really appreciate if someone could explain it to me! And there might probably be more people asking themselves the same questions.

once again, thanks for a great course!