Why zero initializations fails?

Yuan4 · January 7, 2023, 9:29am

If W and b are zeros as initialized, i understand that this neural network may reduce to a logistic linear regression, but why W and b can not be learnt as the loss function be minimized?

rmwkwok · January 7, 2023, 10:52am

Hello @Yuan4,

Let’s use these two slides from Andrew’s lecture titled “Forward and Backward Propagation” in C1 W4:

The first one, and note that I have added a^{[l]} above the arrows of the forward propagation.

The second one, and note that I have added some equation numbers to aid our walk-through in below.

OK. Let’s walk through the forward and backward propagation to see why initializing all weights and biases to zero won’t work in this 3-layer neural network .

Forward phase

a^{[0]} is non-zero. That’s fine.
a^{[1]} = 0 because all weights and biases are zero. BAD.
a^{[2]} = 0 because all weights and biases are zero.
a^{[3]} = 0 because all weights and biases are zero.

Backward phase

da^{[3]} is non-zero because the errors are non-zero.
- dz^{[3]} is non-zero by equation (1) and da^{[3]} \ne 0.
- dw^{[3]} = 0 by equation (2) and a^{[2]} = 0.
- db^{[3]} is non-zero by equation (3).
da^{[2]}=0 by equation (4) and w^{[3]}=0.
- dz^{[2]}=0 is zero by equation (1) and da^{[2]}=0.
- dw^{[2]}=0 by equation (2) and dz^{[2]}=0.
- db^{[2]}=0 by equation (3) and dz^{[2]}=0.
da^{[1]}=0 by equation (4) and w^{[2]}=0.
- dz^{[1]}=0 is zero by equation (1) and da^{[1]}=0.
- dw^{[1]}=0 by equation (2) and dz^{[1]}=0.
- db^{[1]}=0 by equation (3) and dz^{[1]}=0.

Therefore, only the the bias in the 3rd layer can be updated, and the rest can’t because their gradients are always zero. In other words, our neural network can’t learn any further. This is why we don’t want to initialize them to zeros for a multi-layer network.

However, zero initialization isn’t always failing, if we had only the output layer, then the following would happen instead:

Forward phase

a^{[0]} is non-zero. That’s fine.
a^{[1]} = 0 because all weights and biases are zero.

Backward phase

da^{[1]} is non-zero because the errors are non-zero.
- dz^{[1]} is non-zero by equation (1) and da^{[1]} \ne 0.
- dw^{[1]} is non-zero by equation (2) and a^{[0]} \ne 0.
- db^{[1]} is non-zero by equation (3).

There is a very relevant topic called “symmetry breaking” and I am going to refer you to this excellent post by our mentor Paul.

It is a common practice to initialize the weights randomly and the biases to zero.

Cheers,
Raymond

PS: I hope I didn’t make any typo. Please let me know if there is any Thank you!

Yuan4 · January 7, 2023, 11:56am

Thanks a lot! I understand now,

rmwkwok · January 7, 2023, 11:57am

You are welcome @Yuan4!

Raymond