Algorithm refinement: ϵ-greedy policy

I don’t understand how Q(s, a) can be randomly initialized and the neural network parameters are initialized so that Q (s, main) is always low, but firing the main thruster can sometimes be a good idea.
Can someone explain this more concretely?

Hello @yusufnzm,

I assume you are asking about around 2:00 of the entitled video. (Next time, please share the time stamp)

Let’s read through the transcript of that part again:

First, the below is telling us that we will MAKE UP an example to explain why we sometimes want to explore instead to exploit:

Most of the time we try to pick a good action using our current guess of Q(s,a). But the small fraction of the time, let’s say, five percent of the time, we’ll pick an action a randomly. Why do we want to occasionally pick an action randomly?

Second, Andrew told us that it is a made-up neural network that behaves very strangely. I think we want such behavior as an example to continue the discussion without considering too many other possibilities. In this way, we can focus our discussion on what benefit the \epsilon-greedy algorithm can bring us.

Well, here’s why. Suppose there’s some strange reason that Q(s,a) was initialized randomly so that the learning algorithm thinks that firing the main thruster is never a good idea. Maybe the neural network parameters were initialized so that Q(s, main) is always very low.

Third, the down side for being always greedy (always to exploit)

If that’s the case, then the neural network, because it’s trying to pick the action a that maximizes Q(s,a), it will never ever try firing the main thruster. Because it never ever tries firing the main thruster, it will never learn that firing the main thruster is actually sometimes a good idea. Because of the random initialization, if the neural network somehow initially gets stuck in this mind that some things are bad idea, just by chance, then option 1, it means that it will never try out those actions and discover that maybe is actually a good idea to take that action, like fire the main thrusters sometimes.

Last, the up side for exploring sometimes, that you have a chance to try something that the neural network had thought to be always bad but turns out to be good:

Under option 2 on every step, we have some small probability of trying out different actions so that the neural network can learn to overcome its own possible preconceptions about what might be a bad idea that turns out not to be the case. This idea of picking actions randomly is sometimes called an exploration step. Because we’re going to try out something that may not be the best idea, but we’re going to just try out some action in some circumstances, explore and learn more about an action in the circumstance where we may not have had as much experience before.


It is just an example that makes the discussion focused. If you randomly initialize the neural network, of course you can’t guarantee Q(s, main) is always low, but if you really want to make that happen, maybe you can set some bias in the output layer to -9999999 so that Q(s, main) is always very low. However, we don’t mean to do this in a real model, and it is just for the sake of an example.

Whether firing the main thruster is a good idea depends on the environment, but I don’t see why firing the main thruster is always a bad idea, and if it is not, then it means it can sometimes be good.

Therefore, it is just an example to understand what the \epsilon-greedy algorithm can bring us, so please don’t be too serious about the example itself.