Deep Neural Networks Have Many Global Optima

paulinpaloalto · December 20, 2021, 12:06am

[Note: this material was authored by mentor Gordon Robinson and is the contents of a thread he created on the Coursera Forums of the previous version of the course. I’m bringing it over to the new Discourse platform with Gordon’s permission. Thanks, @GordonRobinson!]

A deep neural network has many, many global optima. This is true for any loss function: there is no convex loss function for a deep network.

The reason is simple: the internal area of the network is so symmetric that we can change the weights (systematically, using permutations) to produce another set of weights that gives us exactly the same value for any input. What we are doing is interchanging the roles that some neurons have.

Starting simple, we can consider an interchange of the roles that neurons 1 and 2 in a layer play. By swapping rows 1 and 2 (using this course’s conventions) in the weight matrix (and bias vector) that calculates values for that layer, the values of neurons 1 and 2 are interchanged. If we also interchange columns 1 and 2 of the weight matrix for the next layer, we effectively switch them back.

An interchange is a very simple example of a permutation. Any permutation of the neurons in a layer produces an equivalent set of weights. And when we have many layers we can do these permutations independently on several layers. A formal description of this process uses permutation matrices and their transposes/inverses in the equations.

When our gradient descent search finds a global minimum for cost, there are therefore many other sets of weights that produce exactly that same minimal cost. Many global optima.

How many: the factorial of the largest number of neurons in a layer; multiplied by the factorial of the number of neurons in other layers, and so on. This is an astronomical number of equivalents.

Even stranger things can happen: if you consider taking a path through weight space between two such optima that correspond to swapping two neurons, there is a point where the two neurons have completely balanced roles (this is likely not another optimum). Such a point can be very flat, and again, there are lots of such points.

These symmetries have been known for a long time. Some papers that may be of interest have titles including terms like weight-space symmetry and visualizing loss landscapes.

paulinpaloalto · December 20, 2021, 12:40am

It might be worth making Gordon’s point about the “astronomical number of symmetries” a little more concrete. The combinatorics of permutations get out of hand pretty quickly. One “real world” example that might help bring this point into sharper focus is this question: how many possible unique shuffles are there of a standard deck of playing cards? Well, a deck is 52 cards, of course. So I have 52 choices for the first card, 51 for the second and so forth. So the total number of unique shuffles is:

P = 52 * 51 * 50 * ... * 3 * 2 * 1

That is what mathematicians call “52 factorial” which is notated 52!.

As you can guess, 52! is a big number, but you might be surprised just how big. Here’s what MATLAB has to say:

>> p = factorial(52)
p =
   8.0658e+67
>>

So it’s roughly 8.1 * 10^{67}. That is a very big number. Just as a comparison, consider the number of seconds in the life of the known Universe. If we call the time since the Big Bang 13 billion years give or take, that gives:

Usec = 13 * 10^{9} * 365 * 24 * 60 * 60 = 4.0997 * 10^{17}

So that’s a big number, but about 50 orders of magnitude smaller than 52!. To make that even more clear, let’s do a little thought experiment:

Suppose we have a fleet of ultra fast computers, each one of which can perform one shuffle of the deck per picosecond (10^{-12} seconds) and that we have a trillion of those computers (10^{12}). So a trillion computers each doing a trillion shuffles per second. That’s 10^{24} shuffles per second. Now we let that whole fleet of computers run for the life of the Universe so far and what does that give us:

Total = 4.1 * 10^{17} * 10^{24} = 4.1 * 10^{41}

That’s a really big number but still smaller than 52! by a factor of roughly 2 * 10^{26}. That means with our incredibly fast fleet of shuffling computers, it still will take us more than 1 followed by 26 zeroes times 13 billion years to compute all possible shuffles of a mere 52 cards.

So 52! is a mind-boggling large number. And it’s pretty rare to see a Deep Neural network with less than thousands of parameters.

So the short summary is: Gordon was absolutely not kidding when he said the number of weight symmetries is astronomical. The other thing you can conclude from this is that the actual structure of the cost surfaces in high dimensions is also mind-bogglingly complex. Our human brains just aren’t evolved to be very good at visualizing things in more than 3 dimensions.

Topic		Replies	Views
Global Minima In Neural Network Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	495	May 9, 2023
Local optima in gradient descent Neural Networks and Deep Learning coursera-platform	2	639	March 13, 2022
Cost function shape in neural network Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	803	November 15, 2022
Local Optima with Gradient Descent Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	553	May 30, 2021
Understanding of local optima in deep networks Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	600	April 28, 2023

Deep Neural Networks Have Many Global Optima

Related topics