Hi community,
In Lecture Andrew says,
If you average out these gradients, you find that the oscillations in the vertical direction will tend to average out to something closer to zero. So, in the vertical direction, where you want to slow things down, this will average out positive and negative numbers, so the average will be close to zero.
Question 1 ) whats vertical direction? and what are these positive and negative numbers, that tend to get average out?
Whereas, on the horizontal direction, all the derivatives are pointing to the right of the horizontal direction, so the average in the horizontal direction will still be pretty big.
Question 2) whats horizontal direction? and what derivatives are pointing to the right of the horizontal direction? how will this be big
1 Like
The diagram is showing the “contour lines” on the solution surface projected down onto a 2D plane. The point here is that the surface is not symmetric: if you visualize it in 3 dimensions, one axis is “squashed” relative to the other.
Of course this is ridiculously simplified compared to what is happening in real networks. The fundamental issue is that the dimensions here are the number of parameters and even the relatively small sample networks we deal with here have hundreds or thousands of parameters. Unfortunately it’s impossible to draw diagrams in 100 dimensional space. Then “real” solutions typically have millions or even billions of parameters. GPT-4 has 1.76 trillion parameters apparently. Good luck trying to visualize what the solution surface looks like for that cost function. Here’s a paper from Yann LeCun’s group about visualizing solution surfaces that’s worth a look.
1 Like
On the point about the vertical versus horizontal movement, notice that the reason that the gradient arrows point in the wrong direction (more vertical than horizontal) is precisely because of the “squashed” shape of the contours. The gradients are perpendicular to the tangent to the contour lines at any point. The contour line is the direction of zero change, so the direction of maximal change (either increase or decrease) will be perpendicular to that. Of course another possible solution to that asymmetry would be to normalize the inputs.