Week2 Programming Assignment 1 - random weight initialization

Shera · October 20, 2022, 4:34pm

Hi everyone,
In exercise 5, it’s suggested to multiply the randomly generated weights by a factor of 10. On the other hand, in the first course of the specialization, smaller factors, like 0.01, were recommended when activation functions like sigmoid are involved. The reason given was to keep the weights small in order to be in the part of such activation functions where the gradient is large. So, I played a bit with the factor in the mentioned exercise and, to my surprise, smaller factors turned out to lead to worse results:

Factor 10:
On the train set:
Accuracy: 0.83
On the test set:
Accuracy: 0.86

Factor 0.1:
On the train set:
Accuracy: 0.6
On the test set:
Accuracy: 0.57

Factor 0.01:
On the train set:
Accuracy: 0.4633333333333333
On the test set:
Accuracy: 0.48

Any idea why that’s the case?

paulinpaloalto · October 20, 2022, 5:00pm

It’s great that you are trying experiments like this. You always learn something interesting when you apply the techniques here. It turns out that there is no “one size fits all” silver bullet initialization solution that works well in all cases. It all depends on a) the nature of your data and b) the structure of the model you are training. As you listen to everything that Prof Ng says in Course 2, it is a recurring theme that it’s all about experimentation to figure out what will work in any particular case. He is giving us a tour of the various methods that are available and giving us guidance about how to navigate all these choices in a systematic way.

Another type of experiment you can try is to go back to the Course 1 Week 1 application exercise and notice that they actually used Xavier Initialization in the “deep” case there. Try using the simple init algorithm with randn * 0.01 that they gave us in the Step by Step assignment and the convergence is terrible. You could try the larger multipliers there and see if the problem is just the small size or if it is the per layer size attentuation that you get from the Xavier algorithm.

Shera · October 20, 2022, 5:15pm

Thanks for your reply. It confirms my initial thoughts that it depends on both the data and the network architecture. I was wondering if there are any intuitions one could use as a rough guide where to start from. For example, consider a network with the output layer deploying tanh or sigmoid activation function and other hidden layers using ReLU. I was thinking maybe the deeper the network, the larger the weights could be initialized to, since the ReLU contributions would dominate in such a case.

I’m aware of the fact that there is no silver bullet and I’ll keep experimenting.

Juan_Olano · October 20, 2022, 5:19pm

@Shera , I’d like to add my 2-cents on this:

The assignment asks to add 10 to the W parameters as an experiment, to showcase how starting with too large values in the W params is not ideal. If you jump to the end of the lab, you’ll read in the summary:

Different initializations lead to very different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Resist initializing to values that are too large!
He initialization works well for networks with ReLU activations

The 3rd conclusion reinforces the notion that very large W params are not desirable.

And as to why your experiment with smaller factors turned into worst result, @paulinpaloalto clearly noted, there is not a one-size-fits-all, so experimentation is the key.

Topic		Replies	Views
Week 1, W initialization to large random number, and HE Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	524	August 31, 2021
Why do we multiply the random intial weights by 0.01? Neural Networks and Deep Learning coursera-platform	2	685	September 2, 2022
Weight Initialisation - random can be better than He? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	530	September 30, 2021
Course-2, Week-1, Initialization Lab assignment issue Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	534	May 3, 2022
Improving Deep Neural Networks - WK1 - Video: Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun week-module-1 , coursera-platform	7	152	June 17, 2024

Week2 Programming Assignment 1 - random weight initialization

Related topics