[Note: this material was authored by mentor Gordon Robinson and is the contents of a thread he created on the Coursera Forums of the previous version of the course. I’m bringing it over to the new Discourse platform with Gordon’s permission. Thanks, @GordonRobinson!]
A deep neural network has many, many global optima. This is true for any loss function: there is no convex loss function for a deep network.
The reason is simple: the internal area of the network is so symmetric that we can change the weights (systematically, using permutations) to produce another set of weights that gives us exactly the same value for any input. What we are doing is interchanging the roles that some neurons have.
Starting simple, we can consider an interchange of the roles that neurons 1 and 2 in a layer play. By swapping rows 1 and 2 (using this course’s conventions) in the weight matrix (and bias vector) that calculates values for that layer, the values of neurons 1 and 2 are interchanged. If we also interchange columns 1 and 2 of the weight matrix for the next layer, we effectively switch them back.
An interchange is a very simple example of a permutation. Any permutation of the neurons in a layer produces an equivalent set of weights. And when we have many layers we can do these permutations independently on several layers. A formal description of this process uses permutation matrices and their transposes/inverses in the equations.
When our gradient descent search finds a global minimum for cost, there are therefore many other sets of weights that produce exactly that same minimal cost. Many global optima.
How many: the factorial of the largest number of neurons in a layer; multiplied by the factorial of the number of neurons in other layers, and so on. This is an astronomical number of equivalents.
Even stranger things can happen: if you consider taking a path through weight space between two such optima that correspond to swapping two neurons, there is a point where the two neurons have completely balanced roles (this is likely not another optimum). Such a point can be very flat, and again, there are lots of such points.
These symmetries have been known for a long time. Some papers that may be of interest have titles including terms like weight-space symmetry and visualizing loss landscapes.