Randomly initialize parameter b instead of W

Lo_Ka_Man · August 18, 2022, 5:06am

In week 3 videos, we learned that the weights shouldn’t be initialized to be zeros or they would update identically, whereas the parameter b can be zeros. But how about turning it over? If we initialize the weights to be all zeros and b to be random values, the gradients will not be the same in this case and thus the hidden units will work properly. I think this way is okay in terms of theory, yet the model performance might be worse in practice due to small gradients difference between hidden units. Is this the case?

Mubsi · August 18, 2022, 6:48am

Hi @Lo_Ka_Man,

The equation is wx + b.

In the scenario where we set w as random and b as 0:
In this case, looking at the equation, there will be random values of w and each w will be significantly different from each other, so it doesn’t really matter if b is 0 or random.

But if we put w as 0 and b as random, then the weights would be updated as identical and won’t be much different from each other when we would add a random value of b to it.

Do I make sense ?
Mubsi

Lo_Ka_Man · August 18, 2022, 6:58am

Yes, that’s what I supposed!

Thanks!

paulinpaloalto · August 18, 2022, 4:09pm

Here is a thread which discusses Symmetry Breaking in more detail. The Logistic Regression case does not require it, but real Neural Networks do. It does turn out that you can do as you suggest in the real NN case and either way works. But the standard approach is to initialize the weights randomly and zero the bias values. You can learn either way and Prof Ng doesn’t really discuss the alternative way. It’s just my guess, but I would suspect the reason is you get faster convergence starting with the weights. It might be a fun experiment to pick one of the examples in Week 4 of Course 1 and try training the network both ways and see if you can notice a difference. Let us know if you try this and learn anything interesting. Science!

Lo_Ka_Man · August 21, 2022, 8:11am

Hi @paulinpaloalto ,

I’ve tried different ways of initializing parameters based on the programming assignment in week 4. The experiments are held on the 4-layer NN. Here is the experiment results:

Summary
Initializing the weights to be 0s and randomizing the bias values would slow down the training process significantly (and even fail to converge). Bigger random bias values or learning rates are also unlikely to help much.
W=randn*0.01, b=0s
This is the original (recommended) version having the cost decreased normally.

randowW+zerob537×558 151 KB
W=0s, b=randn*0.01
This is the reversed version. The cost decreases at an extremely slow pace. Besides, the parameters of the late layers (W3, b3, W4, b4) slightly change a bit whereas the rest (W1, b1, W2, b2) almost stay the same as initialized values. It seems that the gradient is too small and becomes nearly zero after the first two steps of back-propagation.

zeroW+randomb452×565 153 KB
W=0s, b=randn*1
Try using 100x bigger random values for b. Similar result to the previous one.

zeroW+random1b465×562 146 KB
W=0s, b=randn*0.01, lr=0.03
Still the reversed version but with a bigger learning rate changed from 0.0075 to 0.03. No improvement.

zeroW+randomb+bigLr463×560 151 KB

That’s all for my trial

paulinpaloalto · August 21, 2022, 3:54pm

Very interesting! Thank you for doing all this and sharing your results.

One thing to note is that the “base case” in the notebook (the one that gets to cost = 0.08xxx after 2500 iterations) does not use the simple W = randn * 0.01 initialization, right? That turns out to give very slow convergence, similar to your other test cases. For that reason, the code they provide us does a more sophisticated strategy called Xavier Initialization that we will learn about in Course 2.

Lo_Ka_Man · August 23, 2022, 3:45am

Sorry for the mistake. The initialization function is imported from the script dnn_app_utils_v3.py and I thought it was exactly the same as the one implemented by myself in the previous assignment. It turns out that dnn_app_utils_v3.py actually initializes the parameters as follow:

I tried the W = randn * 0.01 version and, as you said, it also causes slow convergence.

Thanks for your help!

Topic		Replies	Views
Week 3 Random Initialization Neural Networks and Deep Learning coursera-platform	6	675	May 6, 2022
Random Initalization in Neural Networks Neural Networks and Deep Learning week-3 , coursera-platform	15	64	September 11, 2024
Parameter Initializatio Neural Networks and Deep Learning coursera-platform	1	668	October 14, 2021
How to decide the initial value of weight and bias? Supervised ML: Regression and Classification	3	168	June 14, 2024
Course 2 Week 3 exercise initialize Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	586	June 1, 2021

Randomly initialize parameter b instead of W

Related topics