Why does gradient descent work for neural networks?

paulinpaloalto · February 26, 2024, 5:29pm

Yes, if you retrain with a different random initialization, you will probably get a different solution, which might be better or might be worse. But if you take your current training and run another 10 iterations, you’ll also get a different solution, right? Maybe it’s only a better approximation of the same one, but it’s still different. The point is that the solution surfaces here are incredibly complex. Well beyond the bounds of anything we an visualize with our brains trained only to think and see in 3 dimensions. It turns out that the math actually says that there are lots of reasonable solutions and there are more layers to this question anyway: if you could actually achieve the global minimum, that would probably represent extreme overfitting on the training set. Maybe that would be ok if your training set was beyond huge, but it’s still probably not what you want.

If the cost of running your training is not too extreme, then the approach you describe of doing several training runs and then using the solution that gives the best performance on the test set is probably a good strategy.

Here’s a thread which discusses the issues here in a bit more detail.

Topic		Replies	Views
Week 2 loss function Neural Networks and Deep Learning	1	577	May 23, 2021
Gradient descent C1_W1 Supervised ML: Regression and Classification week-1	2	513	July 31, 2022
Cost function shape in neural network Improving Deep Neural Networks: Hyperparameter tun	2	792	November 15, 2022
Use of squared error with sigmoid and applying gradient descent Neural Networks and Deep Learning week-2 , ai-discussions	6	23	September 30, 2024
Computing from scratch gradient descent for neural networks Advanced Learning Algorithms week-2	2	487	January 21, 2023

Why does gradient descent work for neural networks?

Related topics