Week 1, Programming Assignment initialization, Exercise 1 - initialize_parameters_zeros

kevinsuedmersen · April 21, 2022, 5:45am

Dear AI community,

As explanation why the model is not learning anything when the weights are all initialized with zeros, the given explanation in Programming Assignment initialization - week 1 - Exercise 1 - initialize_parameters_zeros is the following:

“As you can see with the prediction being 0.5 whether the actual ( y ) value is 1 or 0 you get the same loss value for both, so none of the weights get adjusted and you are stuck with the same old value of the weights”.

I am not sure I completely follow the following train of thought: Same loss value for y=1 and y=0, and hence, none of the weights get adjusted.

Please correct me if I am wrong, but I think the following explanation for why the weights won’t get adjusted makes more sense: The derivative of the loss with respect to weight w^l_{j,k} is defined as \delta^l_j a^{l-1}_k, where \delta^l_j is the “error” of the j-th neuron in layer l and where a^{l-1}_k is the k-th activation in layer l-1. Since all weights are zero and since we’re using the ReLU activation function in the hidden layers, z^{l-1}_k = a^{l-1}_k = 0 and therefore \delta^l_j a^{l-1}_k = 0 as well and w^l{j,k} won’t get updated during gradient descent.

So, I don’t understand how the mere values of the cost function for y=0 and y=1 affects the weight gradients. As far as I understand, the only reason why the weights won’t get adjusted is that a^{l-1}_k = 0 since all weights are initialized as zeros and since we’re using the ReLU activation function in the hidden layers. Could you help me clarify this?

Thank you!

paulinpaloalto · April 21, 2022, 2:54pm

Here is a thread which discusses the zero initialization versus “symmetry breaking” in more detail. It turns out that in the case of Logistic Regression, zero initialization works, but any kind of symmetric initialization (zero or some other constant) does not work in the Neural Network case.

kevinsuedmersen · April 22, 2022, 5:54am

Cool, thanks a lot for posting this thread! So for each layer l, if we initialize the weights in layer l with some constant values, and even if we used the sigmoid activation function, the activations in layer l would all be equivalent to each other and there is no benefit of having more than 1 neuron in layer l, because each neuron would learn the same thing in the sense that the weight gradient dW^l will just be a matrix of constant values. Correct?

paulinpaloalto · April 23, 2022, 3:54am

Yes, that is right: if you start with symmetry, then it will be preserved. That is why it is essential to “break symmetry”.

Bin_Shu · December 15, 2023, 7:38pm

Your explanation addresses why it’s symmetry and only learns a simple function. But didn’t answer the original poster’s question, which I’m also confused. I agree that zero initialization won’t work here. But I don’t understand how the reason of it not working is due to the “same loss loss value for y=1 and y=0 and so none of the weights get adjusted and you are stuck with the same old value of the weights”.
The conclusion is correct, but the reasoning may be not.

Bin_Shu · December 15, 2023, 7:45pm

Yes, I agree with you the reasoning part is weak. Delta a is non-zero at a = 1/2 and so is delta z at z = 0, but delta w = 0 since the activation from last layer is 0.

paulinpaloalto · December 15, 2023, 7:54pm

Did you read the thread that I linked in my earlier reply? That demonstrates simply and clearly why zero initialization doesn’t work.

Bin_Shu · December 15, 2023, 8:12pm

YES!
I’m saying I agree and understand that zero initialization doesn’t work. But the reason it’s not working is not as stated in the assignment. In the assignment, it goes with, because the loss have same value no matter the label y is 0 or 1, so it won’t update weights. All weights will remain 0, not just all weight will be same. They are not just symmetry, they are more doomed than that and stuck at the value 0. The explanation in the assignment is too weak.

paulinpaloalto · December 15, 2023, 8:31pm

Sorry, I did not realize that the verbiage being quoted there was from the assignment. I’ll file a bug.

But to your comment, the symmetry is the key point. As I pointed out on that other thread, it’s a special case that tanh is the activation in this example. If it had been sigmoid, the gradients would not have been zero, but the point is that they would be the same for each neuron (symmetric).

Topic		Replies	Views
Concept in Initialization Assignment-Help needed in understanding Improving Deep Neural Networks: Hyperparameter tun	6	658	March 11, 2025
Zero initialization of weights Improving Deep Neural Networks: Hyperparameter tun	2	668	October 27, 2021
Course 2 Week 1 PA 1: Why does zero weight cause no change in loss Improving Deep Neural Networks: Hyperparameter tun	2	631	January 5, 2022
Why don't weights get adjusted when initialized to 0? Improving Deep Neural Networks: Hyperparameter tun	2	478	August 30, 2023
Zeros initialization for weights matrices Improving Deep Neural Networks: Hyperparameter tun	1	623	April 7, 2022

Week 1, Programming Assignment initialization, Exercise 1 - initialize_parameters_zeros

Related topics