Why does gradient descent work for neural networks?

santosh.pothula · February 20, 2024, 9:20pm

In this week, we learnt that gradient descent works for neural network when using “ReLU” or “linear” activation functions. I didn’t understand why that is the case.

For example, when we were learning about logistic regression, the first loss function we tried was mean squared error but later saw that it has multiple local minima, then changed to the logistic loss function. Similarly, why does using the mean square error in NN (Neural Networks) works with complicated architecture and have a single minima for the NN? Why does gradient descent work for it?

TMosh · February 20, 2024, 10:37pm

NN’s can use sigmoid() or tanh() activations also.

It’s not just ReLU or linear.

TMosh · February 20, 2024, 10:39pm

It doesn’t. NN’s always have a non-convex cost function (because of the non-linear activation in the hidden layer). So they don’t have a single minimum.

The trick with an NN is to get a minimum that is “good enough”. The method for “good enough” is using a test set to verify the results after training.

santosh.pothula · February 26, 2024, 5:23pm

Does this mean that we can get a better model that does well on the test set if I re-train the model with different random starting weights?

How does this impact in practice? Is there a general practice to re-train the model , say 5 times and pick the model with better test set performance?

paulinpaloalto · February 26, 2024, 5:29pm

Yes, if you retrain with a different random initialization, you will probably get a different solution, which might be better or might be worse. But if you take your current training and run another 10 iterations, you’ll also get a different solution, right? Maybe it’s only a better approximation of the same one, but it’s still different. The point is that the solution surfaces here are incredibly complex. Well beyond the bounds of anything we an visualize with our brains trained only to think and see in 3 dimensions. It turns out that the math actually says that there are lots of reasonable solutions and there are more layers to this question anyway: if you could actually achieve the global minimum, that would probably represent extreme overfitting on the training set. Maybe that would be ok if your training set was beyond huge, but it’s still probably not what you want.

If the cost of running your training is not too extreme, then the approach you describe of doing several training runs and then using the solution that gives the best performance on the test set is probably a good strategy.

Here’s a thread which discusses the issues here in a bit more detail.

santosh.pothula · February 26, 2024, 6:37pm

Thank you. It is starting to make sense to me now.

Topic		Replies	Views
Cost function shape in neural network Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	825	November 15, 2022
About local minimum in NN Advanced Learning Algorithms	6	111	July 7, 2024
C1_W1_Gradient-Descent Supervised ML: Regression and Classification week-module-1	3	581	July 28, 2022
Will NN return the same parameters for a set of data when run multiple times? Neural Networks and Deep Learning coursera-platform	3	780	October 16, 2022
When there are multiple local minima? Supervised ML: Regression and Classification week-module-1	13	1096	May 27, 2023

Why does gradient descent work for neural networks?

Related topics