Cost function shape in neural network

Hi @KiraDiShira ,

Let me attempt to answer this quoted question, from the more general concept of Gradient Descent:

The goal of the models is to optimize it to predict very close to ground truth. For this optimization we use gradient descent, which assumes that the model can converge.

In simple systems where you have very few dimensions, you would use full gradient descent, a simpler formula to reach optimization. In these simple models with 1-2 dimensions, you can get to local minima that are traps. It may happen.

Then we have the more complex models which involve perhaps millions of parameters. For this we would use Stochastic Gradient Descent (SGD).

Stochastic Gradient (SGD), which is mainly used in complex NN, is unlikely going to get stocked in local minima because by nature it is very noisy. This noisiness may allow it sometimes to skip local minima. So you would say “hmm it is a matter of luck?”

Well, the real reaons why NN can be optimized is that there aren’t that many local minima that are ‘traps’. The complex NN are built in such a way that the parameter space has such high dimensions that there are hardly any local minima.

When we humans imagine the functions in a graph, we usually think in 2D or may be 3D, and we can arguably say that there are high chances of local minima traps, as discussed in simpler models that use Full Gradient Descent.

In 3D, however, we may start gaining intuition that trap-local-minima are rare. You’ll usually find the form of a saddle, where the apparent local minima can actually continue descending by one of the sides. From this intuition, try to imagine a complex neural network. These NN create such complex multi dimensional spaces (perhaps millions of dimensions or more that we cannot visualize), where local minima traps are rare.

This is my understanding of this situation :slight_smile:

2 Likes