Cost function shape in neural network

I understand that each model has cost function that is great for the gradient descent, a function with a global minimum.

For example:

  • Cost function for linear regression is the MSE, which has one global minimum.

  • Cost function for logistic regression is

, which has one global minimum.

For neural network looks different:

When we talk about the neural network model with multiple hidden layer and with a linear output layer with MSE cost function I understand that the MSE cost function can have multiple local minimum. Can I have an intuition about why this happen?

And if looks like we don’t have an idea about the shape of the cost function, how we can choose a cost function? In other words, If I have a linear output layer but I know that MSE doesn’t have a global minimum, why I choose MSE instead of other cost function? If I have an output layer with a sigmoid activation function why I choose

cost function knowing that in deep neural network the cost function shape is different from the simple logistic regression and can have multiple minimum?

Thank you as always

Hi @KiraDiShira ,

Let me attempt to answer this quoted question, from the more general concept of Gradient Descent:

The goal of the models is to optimize it to predict very close to ground truth. For this optimization we use gradient descent, which assumes that the model can converge.

In simple systems where you have very few dimensions, you would use full gradient descent, a simpler formula to reach optimization. In these simple models with 1-2 dimensions, you can get to local minima that are traps. It may happen.

Then we have the more complex models which involve perhaps millions of parameters. For this we would use Stochastic Gradient Descent (SGD).

Stochastic Gradient (SGD), which is mainly used in complex NN, is unlikely going to get stocked in local minima because by nature it is very noisy. This noisiness may allow it sometimes to skip local minima. So you would say “hmm it is a matter of luck?”

Well, the real reaons why NN can be optimized is that there aren’t that many local minima that are ‘traps’. The complex NN are built in such a way that the parameter space has such high dimensions that there are hardly any local minima.

When we humans imagine the functions in a graph, we usually think in 2D or may be 3D, and we can arguably say that there are high chances of local minima traps, as discussed in simpler models that use Full Gradient Descent.

In 3D, however, we may start gaining intuition that trap-local-minima are rare. You’ll usually find the form of a saddle, where the apparent local minima can actually continue descending by one of the sides. From this intuition, try to imagine a complex neural network. These NN create such complex multi dimensional spaces (perhaps millions of dimensions or more that we cannot visualize), where local minima traps are rare.

This is my understanding of this situation :slight_smile:


@KiraDiShira ,

There is a previous post that discusses this very same topic in more depth. I recommend that you read it as it will shed lots of light to your question:


1 Like