Ok, that helps me alot. I do not really care about the consecutiveness, I just needed to understand if its important or not.

Let me explain my understanding of the process and you tell me if I am mistaken at some point, ok?

So we have a model Q which we initiated randomly. I assume there are some restrictions given the fact that we now something about the lunar lander environment, but basically its random.

Now we create a bunch of training examples x comprising of a random state and a random action.

We then calculate Q(s,a) based on that model and a given training example.

Now we have learned some actual information from the “real world” which is R(s), an actual reward, that we got from being in state s. This actual real world information is then combined with the gamma * max(a’)Q(s’ ,a’) to calculate y.

As far as I understand the second part of that sum is completely made up because it depends on the random model Q, except for the fact, that we use the actual state s’ as input, that we got to by taking action a in state s.

Now that we calculated all this, we have something to learn from because there is a difference between Q(s,a) and R(s) + gamma * max(a’)Q(s’ ,a’).

The reason for that difference is, that if we had simply put x, i.e. (s,a), into Q, Q(s,a) had assumed some value for R(s) that is totally random, based on its initialization. But if we actually take an action a, we get a value R(s) from the real world, that is most likely different from the value R(s), that Q would have assumed on its own.

This difference is our information gain on which the whole learning process builds up, because we can now compare y the output of the model including the true value R(s) with yhat, the output of Q without knowledge about the true R(s).

Am I mistaken somewhere?