On the topic of reinforcement learning, when talking about the need to **avoid changing** our target output y on **every iteration** and too **aggressively**, Andrew said we need to do that to avoid ‘**instability**’ when training our Q neural network.

I haven’t read about what instability means. If someone could give me a insight or reading material, I would appreciate.

I will assume that instability means our neural network will be slow to converge because we are ‘moving the goal posts’ too often.

If we image our neural network represented as a point in a state space where every weight and bias is a different dimension, gradient descent updates would move our random neural network towards the point that represents the neural network that best minimizes the cost function calculated with a set of labels y.

When we use an estimated, imperfect y, we get as a result an estimated, imperfect cost function.

Applying gradient descent to this estimated, imperfect cost function moves our random neural network towards another neural network in the state space that would best reduce the estimated cost function. But this is not the best neural network we are looking for.

The best neural network we are looking for is the one that would minimize the cost function calculated from the real labels y that we don’t have.

But even though the neural network is not moving directly towards the best neural network, it’s moving towards one that is close to the best neural network. So we would be making progress.

Improving our estimation of y on every iteration, and improving y more aggressively would just make our estimated cost function closer to the real one (the one defined by the correct y labels that we don’t have).

Having a better estimate of the cost function would make gradient descent move our neural network towards a better neural network than before (one that minimizes our better estimated cost function), one that’s closer to the best neural network (the best neural network is one that minimizes the cost function defined by the correct y labels).

So updating y fully and often would mean better progress, more frequently.

So why wouldn’t we want that? Why would it generate instability?

It would be like walking toward a location and getting a better and better compass each time.

Yes we would wiggle a little more, but the direction we would be readjusting to would always be better.

If we delay a correction to our trajectory we would only spend more time walking in a worse direction. If we choose a softer correction, we would be improving a little when we could be improving a lot.

Every improved y is guaranteed to generate a cost function → gradient that will point our current neural network on the direction of a neural network that is closer to the best one.

So by updating y on every step the neural network that we would be moving to would be changing a lot, and more frequently, but it would be always located inside a ‘hypersphere’ around the best neural network, with its radius getting smaller and smaller after each improvement.

Maybe a possible reason for instability could be that improving the direction that the neural network is moving to, too aggressively, could lead to a overstep, where we end up missing the local minimums.

Because a much better target neural network would be located on a point that is closer to the best neural network, but it would also be much further away from our current neural network.

The much better target neural network could be located ‘behind’ the best neural network in state space. In such a way that when we try to approach it, we would end up missing the best neural network and having to return after that, if we walk with a too big step.

If this would happen many times, we would be missing the best neural network many times, and converge slowly.

But I don’t see how that could happen if we would estimate a new y on every iteration.

By estimating a new improved y on every iteration we would not have the ‘time’ to miss the best neural network, because after every step we would get a better direction.

So probably before we could miss the best neural network by walking across it, we would have already got a better direction to move to, and would correct our path.

Also, if we end up overstepping when taking only a single step, we would for sure overstep when taking more than one step. So how could taking more steps before improving y would help?

If this question was too abstract please ask some clarification questions. Initially I was going to write a lot more and include some pictures but I gave up because I believe it wasn’t necessary.

Thanks.

Douglas