# Why zero initializations fails?

If W and b are zeros as initialized, i understand that this neural network may reduce to a logistic linear regression, but why W and b can not be learnt as the loss function be minimized?

Hello @Yuan4,

Letâ€™s use these two slides from Andrewâ€™s lecture titled â€śForward and Backward Propagationâ€ť in C1 W4:

The first one, and note that I have added a^{[l]} above the arrows of the forward propagation.

The second one, and note that I have added some equation numbers to aid our walk-through in below.

OK. Letâ€™s walk through the forward and backward propagation to see why initializing all weights and biases to zero wonâ€™t work in this 3-layer neural network .

Forward phase

1. a^{[0]} is non-zero. Thatâ€™s fine.
2. a^{[1]} = 0 because all weights and biases are zero. BAD.
3. a^{[2]} = 0 because all weights and biases are zero.
4. a^{[3]} = 0 because all weights and biases are zero.

Backward phase

1. da^{[3]} is non-zero because the errors are non-zero.
• dz^{[3]} is non-zero by equation (1) and da^{[3]} \ne 0.
• dw^{[3]} = 0 by equation (2) and a^{[2]} = 0.
• db^{[3]} is non-zero by equation (3).
2. da^{[2]}=0 by equation (4) and w^{[3]}=0.
• dz^{[2]}=0 is zero by equation (1) and da^{[2]}=0.
• dw^{[2]}=0 by equation (2) and dz^{[2]}=0.
• db^{[2]}=0 by equation (3) and dz^{[2]}=0.
3. da^{[1]}=0 by equation (4) and w^{[2]}=0.
• dz^{[1]}=0 is zero by equation (1) and da^{[1]}=0.
• dw^{[1]}=0 by equation (2) and dz^{[1]}=0.
• db^{[1]}=0 by equation (3) and dz^{[1]}=0.

Therefore, only the the bias in the 3rd layer can be updated, and the rest canâ€™t because their gradients are always zero. In other words, our neural network canâ€™t learn any further. This is why we donâ€™t want to initialize them to zeros for a multi-layer network.

However, zero initialization isnâ€™t always failing, if we had only the output layer, then the following would happen instead:

Forward phase

1. a^{[0]} is non-zero. Thatâ€™s fine.
2. a^{[1]} = 0 because all weights and biases are zero.

Backward phase

1. da^{[1]} is non-zero because the errors are non-zero.
• dz^{[1]} is non-zero by equation (1) and da^{[1]} \ne 0.
• dw^{[1]} is non-zero by equation (2) and a^{[0]} \ne 0.
• db^{[1]} is non-zero by equation (3).

There is a very relevant topic called â€śsymmetry breakingâ€ť and I am going to refer you to this excellent post by our mentor Paul.

It is a common practice to initialize the weights randomly and the biases to zero.

Cheers,
Raymond

PS: I hope I didnâ€™t make any typo. Please let me know if there is any Thank you!

2 Likes

Thanks a lot! I understand now,

You are welcome @Yuan4!

Raymond