I’m doing the weight initialization lab where we see that initializing weights to zero causes the network to fail to break symmetry. For context here’s the text I’m looking at:
The statement is a bit of a simplification. In training we don’t really care about the cost or the predictions directly - what we care about is the errors (y - y_hat), which leads to the gradients.
And initializing to zero isn’t the issue - initializing to any constant value would cause a similar problem.
The issue they’re trying to demonstrate is that if the errors are all identical, then the gradients will all be identical, and all of the hidden layer units will learn exactly the same thing.
Again, this is a little bit of a simplification. But I think the key issue is that the notebook explanation should be in terms of the gradients and not the loss.