C1_W3_Lab06: meandering gradient descent

Greetings!

Do you have an idea for such bended gradient descent trajectory on the screenshot below?
In theory the trajectory should always be perpendicular to the contour lines (since we are computing the gradient), right?

Thank you.

What do you mean exactly by “do you have an idea…”?

Why is this happening?

Sorry, I’m not sure what “this” you are referring to.

It would help if you annotate the screen capture image to highlight the part you’re asking about.

It does look like the gradients are not perpendicular to the contour lines. My guess would be that the rendering just isn’t very accurate. I’m not a mentor for this course, so I’m not sure what he’s saying the lectures there, but when he covers this in DLS one of the points to be made is that this is a graphic argument for why normalization helps: when the scales of the different features are significantly different, you get convergence problems because the perpendicular direction can be leading you in a suboptimal trajectory.

Hello @AKazak,

Here is a similar discussion that shows that if we skew the axes, the path will not be perpendicular.


If we look at the axes’ scales, it is obvious that (1) the w axis is longer than the b axis, but (2) the range of w is smaller than that of b. Therefore, before we say that it is not perpendicular, we would need to correct such effect first.

Cheers,
Raymond

Got it.
What does DLS stand for?

I see your point.
However, to my understanding, linear transformation of the axes should never change angles between vectors and contour lines, that is if a vector is perpendicular to a contour line, then it will stay perpendicular no matter how you linearly scale the axes. Right?

but doesn’t the GIF show you the opposite? I am copying the GIF here

Project001 (3)

I mean, the GIF should establish some fact, but if you have a different hypothesis, you might present your logic on why angle should be an invariant under linear transformation?

Below is, perhaps, a simpler example that shows the changes of angle when we squeeze the x axis.

If we infinitestimally squeeze it, the vector would look parallel to the y-axis, wouldn’t it? I mean, the angle between the y-axis and the vector keeps changing, why is that? why wouldn’t they re-orient in the same rate to keep the angle invariant?

Cheers!

Yes, the vector-to-axis angles will surely change, but the vector-to-contour angles will not.

See the figure below.
I do understand the trajectory part marked by the green oval.
However, I do not understand the trajectory part marked by the red oval.
In my understanding the optimal gradient-descent trajectory should be green arrow.

The Deep Learning Specialization, which is the recommended next step once you finish MLS.

2 Likes

@AKazak, I squeezed the w-axis a bit, now the red one wins!

There are “two angles” we are talking about -

  • the visual angle
  • the theoretical angle between contour and the gradient’s direction

The first angle is affected by how you scale the graph. To make both consistent, we need a 1:1 scale.

1 Like

Thank you for clarifying this.
I totally agree with you and meant “true” theoretical angles between vector and contour in radians.

How about the green trajectory below, that seems to be shorter than the original dual-segmented trajectory?
In my understanding, if you update all components of vector w independently, then it should follow the green trajectory. Right?

That’s a good question! The thing is, distance is not a decisive factor - it does not have to follow the shortest path. On the contrary, as you have mentioned in the very beginning, trajectory should be “perpendicular” to the contour lines and that is the decisive factor.

Allow me to refer to the following graph instead of the one in your last post because this one is closer to 1:1 and still shows that the green arrow is the shortest.

When will the trajectory be the shortest path? One example is when all contours are perfect circles, then you have the normal always pointing towards the center of the the circle. Here, we have something like ellipses.

In fact, if Gradient descent “knows” the shortest path, model training would have been much easier! For lab, for lecture and for linear models, we can draw out the contours and we can tell what the shortest path is. But in real non-linear cases, we don’t know the contours beforehand, not to mention what the shortest path should be.

Before the model takes its next step towards a hopeful minimum, gradient descent only decide the direction with information around the current location - it only possesses local information ,not global. In contrast, to know and to follow the shortest path, it requires global information that gradient descent does not have.

As its name suggests, it descents based on (local) gradient, not shortest path.