Uniqueness of solutions in shallow 2-layer NN

This is about a neural network with 1 hidden layer and 1 output layer - so 2-layer NN based on Andrew’s terminology. I used the noisy_moons data in sklearn.datasets.
noisy_moons = sklearn.datasets.make_moons(n_samples=5000, noise=.2)

I used a tanh activation function for the hidden layer, sigmoid for the output layer. I then ran it with 4 hidden units.

Every time I run the model, I converge to around the same cost function, the same accuracy of 97.1%, but the weights for the hidden model are different.

e.g.
Run1 (4 hidden units, 2-D X input as above)
Final cost: 0.08163652789504161
Weights for the hidden layer:
1 -3.099451 0.973812
0 -2.667409 1.146674
2 1.932482 1.273789
3 2.582570 -0.299258

Run2 (4 hidden units, 2-D X input as above)
Final cost: 0.08117809268985378
Weights for the hidden layer:
3 -2.838430 1.277868
1 -1.981524 -1.347578
2 -0.517203 -0.098092
0 2.973090 -0.781550

The final costs are very similar but the weights vary.
Is this OK?

These are great questions! You always learn something by trying to apply what we’ve learned. The cost function for Neural Networks is no longer convex, so there are lots of local optima. If you are not specifying a fixed random seed on your initialization, then it’s perfectly possible that you’ll find different solutions every time. The number of possible different solutions is combinatorially huge. Fortunately most of them have pretty similar performance. There’s an important paper from Yann LeCun’s group about this.

The other point to make is that the lower cost is not really the end goal, right? It’s the prediction accuracy that we actually care about, but the cost is an easy proxy for whether convergence is working or not. Do you also get similar performance between your various solutions when you evaluate them using prediction accuracy? (Oh, sorry, you already said that: you get the same 97% accuracy. So it’s all good!)

Thanks! The paper was very helpful.

1 Like