Why does gradient descent work for neural networks?

Advanced Optimization | Coursera

In this week, we learnt that gradient descent works for neural network when using “ReLU” or “linear” activation functions. I didn’t understand why that is the case.

For example, when we were learning about logistic regression, the first loss function we tried was mean squared error but later saw that it has multiple local minima, then changed to the logistic loss function. Similarly, why does using the mean square error in NN (Neural Networks) works with complicated architecture and have a single minima for the NN? Why does gradient descent work for it?

NN’s can use sigmoid() or tanh() activations also.

It’s not just ReLU or linear.

It doesn’t. NN’s always have a non-convex cost function (because of the non-linear activation in the hidden layer). So they don’t have a single minimum.

The trick with an NN is to get a minimum that is “good enough”. The method for “good enough” is using a test set to verify the results after training.

1 Like

Does this mean that we can get a better model that does well on the test set if I re-train the model with different random starting weights?

How does this impact in practice? Is there a general practice to re-train the model , say 5 times and pick the model with better test set performance?

Yes, if you retrain with a different random initialization, you will probably get a different solution, which might be better or might be worse. But if you take your current training and run another 10 iterations, you’ll also get a different solution, right? Maybe it’s only a better approximation of the same one, but it’s still different. The point is that the solution surfaces here are incredibly complex. Well beyond the bounds of anything we an visualize with our brains trained only to think and see in 3 dimensions. It turns out that the math actually says that there are lots of reasonable solutions and there are more layers to this question anyway: if you could actually achieve the global minimum, that would probably represent extreme overfitting on the training set. Maybe that would be ok if your training set was beyond huge, but it’s still probably not what you want.

If the cost of running your training is not too extreme, then the approach you describe of doing several training runs and then using the solution that gives the best performance on the test set is probably a good strategy.

Here’s a thread which discusses the issues here in a bit more detail.

Thank you. It is starting to make sense to me now.