When we implemented Logistic Regression in Week 2 of Course 1, Prof Ng tells us that it suffices to use zero initialization for the w and b values that will be learned through Gradient Descent and we will still get a valid solution. But when we get to real Neural Networks in Week 3 with the 2 layer net, we can no longer get away with using zero initializations and need to use random initializations of all the W^{[l]} values in order to “break symmetry”. But Prof Ng only mentions this briefly and doesn’t really get into much detail. Frequently people ask why Logistic Regression is different in this respect. As always, it comes back to the math.
Let’s start by looking at the formula for the gradient of w in Logistic Regression:
dw = \displaystyle \frac{1}{m} X \cdot (A - Y)^T
If we start with w and b all zeros, then A will be 0.5 for all samples, since sigmoid(0) = 0.5 . The Y values are all either 0 or 1 (the labels), so you end up with a non-zero values for dw. That means that Gradient Descent can learn a different value for w even starting with w and b as zeros. It works similarly for b.
Now look at the formulas for the shallow two layer NN case in DLS C1 W3 and things are quite a bit more complicated:
dZ^{[2]} = A^{[2]} - Y
dW^{[2]} = \displaystyle \frac{1}{m} dZ^{[2]} \cdot A^{[1]T}
dZ^{[1]} = W^{[2]T} \cdot dZ^{[2]} * g^{[1]'}(Z^{[1]})
dW^{[1]} = \displaystyle \frac{1}{m} dZ^{[1]} \cdot X^T
Starting with both W and both b values zero, you can see that A^{[1]} will be all zeros because g^{[1]} is tanh
and tanh(0) = 0
. So that will result in dW^{[2]} being zero. Then W^{[2]} is all zeros, so dZ^{[1]} will be zero and that means dW^{[1]} will also end up being zero. So Gradient Descent is stuck: it can’t change either of the W matrices, meaning that no learning can take place.
Note that there are some further subtleties to explore here: e.g. what if we changed the layer one activation to something like sigmoid that doesn’t give 0 output for 0 input, would that change the result? It turns out it does not. See below for more on that.
It actually turns out that you have a choice of the method to “break symmetry” here: you can do as Prof Ng recommends and randomly initialize all the W matrices, but set the biases to zero initially. Or you can do the reverse: set the W values all to zero, but randomly initialize the bias values. In either case symmetry will be broken and Gradient Descent can learn.
What about using non-zero initializations but making them all the same?
The next question is what happens if we use an activation function in the hidden layer for which g(0) \neq 0? Or we could initialize the weights and biases with the same small values as in:
W1 = np.ones((n1, n0)) * 0.01
It turns out that what happens if you use that style of initialization is that you can learn new values of all the W^{[l]} and b^{[l]} parameters, but the values that you get are themselves symmetric in the sense that all the neurons output the same values at every level. That means that it’s equivalent to a network with just a single neuron. So it’s not just that there’s something bad about starting with zero values: what is bad is starting with all the same values, either zero or non-zero. It really is the case that you need to “break symmetry” by starting with values that are all different at least for the W^{[l]} values. We could dig deeper and prove this mathematically, but it’s easy enough and kind of fun just to try it yourself and watch what happens. I used the Week 4 Application assignment code and then added my own “init” routine using the technique above and then ran 200 iterations and printed out all the W^{[l]} and b^{[l]} values and it’s easy to see what happens!