Always exciting to revisit these fundamental concepts as we look at things differently.
I was curious about the starting point for Gradient Descent when you are on top of the hill as explained in the video.
How do you begin taking that “baby step” with the fastest drop when you are starting on top of the hill and all immediate directions are equal?
I think that if all the immediate directions are equal then we are at a local maxima, where the derivative is 0 and we are stuck at that point, in reality when we initialise weights randomly there is a very less chance we encounter a local maxima, most of the times we are not actually on the top of the hill but somewhere in between.
Even if we get stuck at local minima, we may run the model many times hoping it would reach global optima in any one of those.