Why does gradient descent work for neural networks?

Yes, if you retrain with a different random initialization, you will probably get a different solution, which might be better or might be worse. But if you take your current training and run another 10 iterations, you’ll also get a different solution, right? Maybe it’s only a better approximation of the same one, but it’s still different. The point is that the solution surfaces here are incredibly complex. Well beyond the bounds of anything we an visualize with our brains trained only to think and see in 3 dimensions. It turns out that the math actually says that there are lots of reasonable solutions and there are more layers to this question anyway: if you could actually achieve the global minimum, that would probably represent extreme overfitting on the training set. Maybe that would be ok if your training set was beyond huge, but it’s still probably not what you want.

If the cost of running your training is not too extreme, then the approach you describe of doing several training runs and then using the solution that gives the best performance on the test set is probably a good strategy.

Here’s a thread which discusses the issues here in a bit more detail.