Could I get your help in interpreting the quiver plot in the last optional lab for week 1?
For the plot, it says
" The ‘quiver plot’ on the right provides a means of viewing the gradient of both parameters. The arrow sizes reflect the magnitude of the gradient at that point. The direction and slope of the arrow reflects the ratio of ∂𝐽(𝑤,𝑏)∂𝑤 and ∂𝐽(𝑤,𝑏)∂𝑏 at that point. Note that the gradient points away from the minimum. Review equation (3) above. The scaled gradient is subtracted from the current value of 𝑤 or 𝑏. This moves the parameter in a direction that will reduce cost."
I understand every word but I don’t think I understand the plot.
“The arrow sizes reflect the magnitude of the gradient at that point.” - the thicker the arrows, the bigger the gradient. The arrows seem to be of the same thickness?
What does the color gradient of the arrows represent?
“The direction and slope of the arrow reflects the ratio of ∂𝐽(𝑤,𝑏)∂𝑤 and ∂𝐽(𝑤,𝑏)∂𝑏 at that point.” - dividing the partial derivative for w by the partial derivative for b, we get the direction and the slope of the arrows. Why do we do it, and what does it tell us?
It is related to the length of the arrows as well. So, the length and the color tell the same thing - the magnitude of the gradient. We can certainly remove the coloring.
Each arrow speaks about the\frac{\partial{J}}{\partial{w}} and the\frac{\partial{J}}{\partial{b}} at that location of w and b.
The “width” (size of horizontal projection) of the arrow is the magnitude of the \frac{\partial{J}}{\partial{w}}, whereas the “height” is \frac{\partial{J}}{\partial{b}}. The higher the magnitude, the steeper the gradient.
With both “width” and “height”, they define the direction. E.g. long “width” + short “height” gives a rather horizontal arrow.
dividing the “height” by the “width” is the slope, or the ratio of the two gradients.
Why do we do it, and what does it tell us?
To understand how gradient goes downward, or going from high gradient (long arrows) to low gradient (short arrows) (aka gradient descent). You will see that if we add a few things on the plot like this
I used rectangle boxes to represent the region of smallest gradient values (shortest arrows), this is also the goal of the gradient descent process.
Red circles as the different initial parameter values for w and b (we usually randomly initialize them, so any pairs of values are possible)
In gradient descent, we update w by w := w - \alpha\frac{\partial{J}}{\partial{w}}, and b by b := b - \alpha\frac{\partial{J}}{\partial{b}}, so in each update, for example, the value of w is decreased by \frac{\partial{J}}{\partial{w}} which is the “width” of the arrow at that location, and similar for b. So consider when \alpha=1, each update step moves one arrow closer to the smallest region.
Note that w and bmoves in the opposite direction of the arrows, because of the minus sign in the update formula. (w := w - \alpha\frac{\partial{J}}{\partial{w}}, b := b - \alpha\frac{\partial{J}}{\partial{b}}), when the gradient is negative, the parameter increases!
Last piece of note:
In my example, will gradient descent end up in two very different final w and b because of the different initial w and b?
Not really. If you look at my following graph, that intentionally set all magnitude of arrows the same in order to just show you the direction, you see that the arrows start to turn around (I have circled two of them) inside the smallest region:
Those arrows will finally (reversely) point to the green dot which is the optimal w and b which is also what you will see in the lab after training the model: (w \approx 200, b \approx 100).
Therefore, story mode, no matter where we start w and b, in the example of the lab, throughout gradient descent, it will move one arrow (assume \alpha=1) at a time closer to the smallest region, until the arrows start to turn around towards the optimal solution, and finally reach there!
Remember this story is only for the example in the lab, most models do not bring us to the same optimal solution regardless of the initial parameters. However, the analogy of gradient descent about moving one arrow (when \alpha=1) at each step is always valid.
Hi @rmwkwok , thank you so much for answering my questions. Your answer is very informative and thorough, and it is truly helpful for me to understand the quiver plot. Now I know what this graph means and why we use it. I think it is a handy plot to show a lot of info for gradient descent, in addition to the contour plot and others, and I am so happy to have learned it from the optional lab. Thank you again~~
Is this a correct interpretation of the quiver plot?
w and b are scalars, at each point x, y in the plot, the vector v represents the gradient of J(w, b) with respect to w and b respectively [dJ_dw, dj_db]. For example, if current value of w, b were to be considered a set of positively charged particles, and the derivatives being a magnet where the tip is positively charged, then these particles move to the other end of the magnet with the strength of the magnet being proportional to the derivative, and the size of the jump related to alpha. The next location of the (higher dimensional) point can be thought of as the set of locations where each single dimensional particle ended up jumping to. So in a sense, going in the opposite direction of the arrow (gradient) is the second step, with the first step being going in the opposite direction of the strongest magnet (derivative). The quiver plot just combines these two steps and allows us to think about the movement or a flow of a particle, whereas a contour plot is more focused on the surface itself.
A contour plot is us looking from above at J(w, b) and visualizing J(w, b) as color and contours where J(w, b) takes on the same value, whereas a quiver plot (also looking at J(w, b) from above) physically shows us these gradients as vectors and allows us to think in a higher dimension about what’s going by being able to interpret these arrows as the second step in the computation.
Does the Region of the smallest gradient magnitude mean the global minimum point of the cost function?
I think these arrows are trying to give a view of a 3D plot of the cost function with respect to w and b. In the lecture, we learned the 2D plot of the cost function with respect to w. The arrow in the negative slope direction means the points are changing from left to right, and the positive slope direction means the points are changing from right to left in order to reach the global minimum point, and the ratio is the value of the slope.
Did I say wrong?
A region has many points, so the region cannot mean a point. I would prefer to say “the region contains a minimum point of the cost function”. I intentionally dropped “global”, because whether it is global cannot be told by the graph.
From the courses, certainly you know only some special case has just one minimum for us to be sure that it is global.
“the negative slope direction”, “the positive slope direction”, “the ratio”.
If these three terms come from what I had written in the replies above, would you please quote the relevant sentences?
If you created these terms, could you explain them together with any relevant sentences/graphs from the replies above that you may be referring to?
You can quote a sentence or a graph by highlighting it with mouse cursor, then a menu will pop up, and you can click “Quote”. You can repeat these steps to quote multiple items, just like what I did in my last reply to you.
Example:
We need your help so that we can follow what you are saying.
I think the “direction” in the quote denotes whether the arrow is downward or upward, and the “ratio” in the quote is the magnitude of the slope of an arrow. We don’t have the scope to know if the downward or upward arrow’s slope is a positive or negative one from the quiver plot.
We had the scope to know if the arrow’s slope is positive or negative in the “2D cost function vs. W” graph plot.
Now I think the following part of my previous replies is relevant to your questions:
I suppose you wanted to summarize things in your own words, right? Let’s take a look:
I can get what you are trying to deliver, but just a little comment on your choice of wording.
Usually we say a vector (arrow) has two components: direction and magnitude. In our case, both of them are affected by \partial_wJ :=\frac{\partial{J}}{\partial{w}} and \partial_bJ :=\frac{\partial{J}}{\partial{b}}.
The magnitude, by pythagoras theorem, is equal to \sqrt{\partial_bJ^2 + \partial_wJ^2}, whereas
the direction, by geometry, is obtained with tan\theta = \frac{\partial_bJ}{\partial_wJ}
Therefore, both of magnitude and direction reflect \partial_bJ and \partial_wJ but in different mathematical ways, which is not just ratio.
Your “slope” is more likely to be known as direction. Therefore, although it is right for you to say both direction and slope reflect the ratio, it is just strange for you to mention the same thing twice. Instead, if you want to mention two different things, then that will be direction and magnitude. However, mentioning those two things mean that we can no longer just say “ratio” because, as we already see from the maths, they are not just ratio.
We can. Slope = \frac{\partial_bJ}{\partial_wJ}, and if it is negative, then only either one of {\partial_bJ} and {\partial_wJ} is negative. Consequently, the arrow is going to point towards upper left or lower right.