If W and b are zeros as initialized, i understand that this neural network may reduce to a logistic linear regression, but why W and b can not be learnt as the loss function be minimized?
Hello @Yuan4,
Letâ€™s use these two slides from Andrewâ€™s lecture titled â€śForward and Backward Propagationâ€ť in C1 W4:
The first one, and note that I have added a^{[l]} above the arrows of the forward propagation.
The second one, and note that I have added some equation numbers to aid our walkthrough in below.
OK. Letâ€™s walk through the forward and backward propagation to see why initializing all weights and biases to zero wonâ€™t work in this 3layer neural network .
Forward phase
 a^{[0]} is nonzero. Thatâ€™s fine.
 a^{[1]} = 0 because all weights and biases are zero. BAD.
 a^{[2]} = 0 because all weights and biases are zero.
 a^{[3]} = 0 because all weights and biases are zero.
Backward phase

da^{[3]} is nonzero because the errors are nonzero.
 dz^{[3]} is nonzero by equation (1) and da^{[3]} \ne 0.
 dw^{[3]} = 0 by equation (2) and a^{[2]} = 0.
 db^{[3]} is nonzero by equation (3).

da^{[2]}=0 by equation (4) and w^{[3]}=0.
 dz^{[2]}=0 is zero by equation (1) and da^{[2]}=0.
 dw^{[2]}=0 by equation (2) and dz^{[2]}=0.
 db^{[2]}=0 by equation (3) and dz^{[2]}=0.

da^{[1]}=0 by equation (4) and w^{[2]}=0.
 dz^{[1]}=0 is zero by equation (1) and da^{[1]}=0.
 dw^{[1]}=0 by equation (2) and dz^{[1]}=0.
 db^{[1]}=0 by equation (3) and dz^{[1]}=0.
Therefore, only the the bias in the 3rd layer can be updated, and the rest canâ€™t because their gradients are always zero. In other words, our neural network canâ€™t learn any further. This is why we donâ€™t want to initialize them to zeros for a multilayer network.
However, zero initialization isnâ€™t always failing, if we had only the output layer, then the following would happen instead:
Forward phase
 a^{[0]} is nonzero. Thatâ€™s fine.
 a^{[1]} = 0 because all weights and biases are zero.
Backward phase

da^{[1]} is nonzero because the errors are nonzero.
 dz^{[1]} is nonzero by equation (1) and da^{[1]} \ne 0.
 dw^{[1]} is nonzero by equation (2) and a^{[0]} \ne 0.
 db^{[1]} is nonzero by equation (3).
There is a very relevant topic called â€śsymmetry breakingâ€ť and I am going to refer you to this excellent post by our mentor Paul.
It is a common practice to initialize the weights randomly and the biases to zero.
Cheers,
Raymond
PS: I hope I didnâ€™t make any typo. Please let me know if there is any Thank you!
Thanks a lot! I understand now,