Hi,

I understand that we need to smooth the oscillation but I am confused why b is y axis and w is on x axis. I am also not sure why DW needs to be small and DB needs to be large as described in the lecture.

Hi,

I understand that we need to smooth the oscillation but I am confused why b is y axis and w is on x axis. I am also not sure why DW needs to be small and DB needs to be large as described in the lecture.

w and b are on the two axes because it’s a plot of cost vs w and b.

Thank you, could you please explain a bit more? still confused

I guess I do not understand your question fully.

If you examine the formulas, note that W and b are both being handled in the same way. There is no significance to which one is depicted as the x or y axis in the picture. And the bigger point here is that it’s essentially impossible to realistically show what is going on in just 2 or 3 dimensions. Prof Ng is just doing the best he can to give some geometric intuition by drawing the picture in 2 dimensions. The dimensions of real networks are typically in the thousands or even higher. The claim is that GPT-4 has 1 trillion parameters. How do you draw a graph in 1000 dimensional space, let alone trillion dimensional space?

Hello Anson @ansonchantf,

Let me expand a bit more on Paul’s answer, if we look for any reason that supports the asymmetry between W and b in this one lecture, **we won’t be able to**, because everything written down there, except the graph, were only showing the symmetric side of W and b.

As Paul pointed out, the purpose of the graph is to show the effect, which I think you already understood because you said so in the first post. In other words, for the purpose of this lecture, there is no need to show any underlying reason that supports the assymetric behavior displayed in the graph.

However, if we have to imagine some reasons, for curiosity let’s say, that can support such asymmetry, then, perhaps, it may be a problem of 1 feature and 1 label being modeled by just one output layer, and that the range of the feature is larger than the range of the label, so that the cost surface looks like a wide ellipse. And because the learning rate is too large for the b-dimension (but not large for w-dimension), so it oscillates faster in the b-dimension. However, whatever cause I wrote in this paragraph was just a guess, and the cause (not the effect) for this graph is not the key for explaining RMSProp

Cheers,

Raymond

1 Like

Thank you all!