As explanation why the model is not learning anything when the weights are all initialized with zeros, the given explanation in Programming Assignment initialization - week 1 - Exercise 1 - initialize_parameters_zeros is the following:
“As you can see with the prediction being 0.5 whether the actual ( y ) value is 1 or 0 you get the same loss value for both, so none of the weights get adjusted and you are stuck with the same old value of the weights”.
I am not sure I completely follow the following train of thought: Same loss value for y=1 and y=0, and hence, none of the weights get adjusted.
Please correct me if I am wrong, but I think the following explanation for why the weights won’t get adjusted makes more sense: The derivative of the loss with respect to weight w^l_{j,k} is defined as \delta^l_j a^{l-1}_k, where \delta^l_j is the “error” of the j-th neuron in layer l and where a^{l-1}_k is the k-th activation in layer l-1. Since all weights are zero and since we’re using the ReLU activation function in the hidden layers, z^{l-1}_k = a^{l-1}_k = 0 and therefore \delta^l_j a^{l-1}_k = 0 as well and w^l{j,k} won’t get updated during gradient descent.
So, I don’t understand how the mere values of the cost function for y=0 and y=1 affects the weight gradients. As far as I understand, the only reason why the weights won’t get adjusted is that a^{l-1}_k = 0 since all weights are initialized as zeros and since we’re using the ReLU activation function in the hidden layers. Could you help me clarify this?
Here is a thread which discusses the zero initialization versus “symmetry breaking” in more detail. It turns out that in the case of Logistic Regression, zero initialization works, but any kind of symmetric initialization (zero or some other constant) does not work in the Neural Network case.
Cool, thanks a lot for posting this thread! So for each layer l, if we initialize the weights in layer l with some constant values, and even if we used the sigmoid activation function, the activations in layer l would all be equivalent to each other and there is no benefit of having more than 1 neuron in layer l, because each neuron would learn the same thing in the sense that the weight gradient dW^l will just be a matrix of constant values. Correct?
Your explanation addresses why it’s symmetry and only learns a simple function. But didn’t answer the original poster’s question, which I’m also confused. I agree that zero initialization won’t work here. But I don’t understand how the reason of it not working is due to the “same loss loss value for y=1 and y=0 and so none of the weights get adjusted and you are stuck with the same old value of the weights”.
The conclusion is correct, but the reasoning may be not.
Yes, I agree with you the reasoning part is weak. Delta a is non-zero at a = 1/2 and so is delta z at z = 0, but delta w = 0 since the activation from last layer is 0.
YES!
I’m saying I agree and understand that zero initialization doesn’t work. But the reason it’s not working is not as stated in the assignment. In the assignment, it goes with, because the loss have same value no matter the label y is 0 or 1, so it won’t update weights. All weights will remain 0, not just all weight will be same. They are not just symmetry, they are more doomed than that and stuck at the value 0. The explanation in the assignment is too weak.
Sorry, I did not realize that the verbiage being quoted there was from the assignment. I’ll file a bug.
But to your comment, the symmetry is the key point. As I pointed out on that other thread, it’s a special case that tanh is the activation in this example. If it had been sigmoid, the gradients would not have been zero, but the point is that they would be the same for each neuron (symmetric).