I am posting for the first time, but I hope this observation will be interesting for you all.
Watching through the videos on the Gradient Descent algorithm, and specifically its end state, an observation came to my mind:
If we accidentally choose our initial w(i) and b values such that we land on a local maximum in the cost function plot, the algorithm would fail - it would get “zero” for each derivative and get stuck in place, right?
Naturally, I’m not talking about linear regression, because it should be a more complex model (otherwise the cost function would be bowl-shaped and not have any maximum).
I guess there’s a trick to avoiding this problem ?
I was thinking about some random step in any direction, just to get the algorithm rolling?
If w and b are at local maxima, we get stuck at that point as the gradient is zero.
That’s why we use techniques like random initialization of w, which makes encountering local maxima highly unlikely.
And another point is that in most machine learning algorithms, we have lots of features.
Suppose a model has 50 features, so its w would be 50 dimensional, and for a point to be a local maximum, it has to be local maxima in all the 50 dimensions, which is highly unlikely to happen.
In reality, a machine learning model is improbable to encounter local maxima or local minima; mostly, they would be stuck at saddle points or plateaus where the derivative stays close to 0 for a long time, making training slow.
Also, we prefer to use cost functions that are convex, so there are no local maxima. All the gradients will point “downhill” toward the global minimum.