In week 3 videos, we learned that the weights shouldn’t be initialized to be zeros or they would update identically, whereas the parameter b can be zeros. But how about turning it over? If we initialize the weights to be all zeros and b to be random values, the gradients will not be the same in this case and thus the hidden units will work properly. I think this way is okay in terms of theory, yet the model performance might be worse in practice due to small gradients difference between hidden units. Is this the case?
Hi @Lo_Ka_Man,
The equation is wx + b
.
In the scenario where we set w
as random and b
as 0:
In this case, looking at the equation, there will be random values of w
and each w
will be significantly different from each other, so it doesn’t really matter if b
is 0 or random.
But if we put w
as 0 and b
as random, then the weights would be updated as identical and won’t be much different from each other when we would add a random value of b
to it.
Do I make sense ?
Mubsi
Yes, that’s what I supposed!
Thanks!
Here is a thread which discusses Symmetry Breaking in more detail. The Logistic Regression case does not require it, but real Neural Networks do. It does turn out that you can do as you suggest in the real NN case and either way works. But the standard approach is to initialize the weights randomly and zero the bias values. You can learn either way and Prof Ng doesn’t really discuss the alternative way. It’s just my guess, but I would suspect the reason is you get faster convergence starting with the weights. It might be a fun experiment to pick one of the examples in Week 4 of Course 1 and try training the network both ways and see if you can notice a difference. Let us know if you try this and learn anything interesting. Science!
Hi @paulinpaloalto ,
I’ve tried different ways of initializing parameters based on the programming assignment in week 4. The experiments are held on the 4-layer NN. Here is the experiment results:
-
Summary
Initializing the weights to be 0s and randomizing the bias values would slow down the training process significantly (and even fail to converge). Bigger random bias values or learning rates are also unlikely to help much. -
W=randn*0.01, b=0s
This is the original (recommended) version having the cost decreased normally.
-
W=0s, b=randn*0.01
This is the reversed version. The cost decreases at an extremely slow pace. Besides, the parameters of the late layers (W3, b3, W4, b4) slightly change a bit whereas the rest (W1, b1, W2, b2) almost stay the same as initialized values. It seems that the gradient is too small and becomes nearly zero after the first two steps of back-propagation.
-
W=0s, b=randn*1
Try using 100x bigger random values for b. Similar result to the previous one.
-
W=0s, b=randn*0.01, lr=0.03
Still the reversed version but with a bigger learning rate changed from 0.0075 to 0.03. No improvement.
That’s all for my trial
Very interesting! Thank you for doing all this and sharing your results.
One thing to note is that the “base case” in the notebook (the one that gets to cost = 0.08xxx
after 2500 iterations) does not use the simple W = randn * 0.01
initialization, right? That turns out to give very slow convergence, similar to your other test cases. For that reason, the code they provide us does a more sophisticated strategy called Xavier Initialization that we will learn about in Course 2.
Sorry for the mistake. The initialization function is imported from the script dnn_app_utils_v3.py
and I thought it was exactly the same as the one implemented by myself in the previous assignment. It turns out that dnn_app_utils_v3.py
actually initializes the parameters as follow:
I tried the W = randn * 0.01
version and, as you said, it also causes slow convergence.
Thanks for your help!