How to interpret the quiver plot in the optional lab?

Hi @himl,

The size here refers to the length of the arrows.

It is related to the length of the arrows as well. So, the length and the color tell the same thing - the magnitude of the gradient. We can certainly remove the coloring.

  1. Each arrow speaks about the \frac{\partial{J}}{\partial{w}} and the \frac{\partial{J}}{\partial{b}} at that location of w and b.
  2. The “width” (size of horizontal projection) of the arrow is the magnitude of the \frac{\partial{J}}{\partial{w}}, whereas the “height” is \frac{\partial{J}}{\partial{b}}. The higher the magnitude, the steeper the gradient.
  3. With both “width” and “height”, they define the direction. E.g. long “width” + short “height” gives a rather horizontal arrow.
  4. dividing the “height” by the “width” is the slope, or the ratio of the two gradients.

Why do we do it, and what does it tell us?

To understand how gradient goes downward, or going from high gradient (long arrows) to low gradient (short arrows) (aka gradient descent). You will see that if we add a few things on the plot like this

  1. I used rectangle boxes to represent the region of smallest gradient values (shortest arrows), this is also the goal of the gradient descent process.
  2. Red circles as the different initial parameter values for w and b (we usually randomly initialize them, so any pairs of values are possible)
  3. In gradient descent, we update w by w := w - \alpha\frac{\partial{J}}{\partial{w}}, and b by b := b - \alpha\frac{\partial{J}}{\partial{b}}, so in each update, for example, the value of w is decreased by \frac{\partial{J}}{\partial{w}} which is the “width” of the arrow at that location, and similar for b. So consider when \alpha=1, each update step moves one arrow closer to the smallest region.
  4. Note that w and b moves in the opposite direction of the arrows, because of the minus sign in the update formula. (w := w - \alpha\frac{\partial{J}}{\partial{w}}, b := b - \alpha\frac{\partial{J}}{\partial{b}}), when the gradient is negative, the parameter increases!

Last piece of note:

In my example, will gradient descent end up in two very different final w and b because of the different initial w and b?

Not really. If you look at my following graph, that intentionally set all magnitude of arrows the same in order to just show you the direction, you see that the arrows start to turn around (I have circled two of them) inside the smallest region:

Those arrows will finally (reversely) point to the green dot which is the optimal w and b which is also what you will see in the lab after training the model: (w \approx 200, b \approx 100).

Therefore, story mode, no matter where we start w and b, in the example of the lab, throughout gradient descent, it will move one arrow (assume \alpha=1) at a time closer to the smallest region, until the arrows start to turn around towards the optimal solution, and finally reach there!

Remember this story is only for the example in the lab, most models do not bring us to the same optimal solution regardless of the initial parameters. However, the analogy of gradient descent about moving one arrow (when \alpha=1) at each step is always valid.

Cheers!

1 Like