Hi everyone,
In exercise 5, it’s suggested to multiply the randomly generated weights by a factor of 10. On the other hand, in the first course of the specialization, smaller factors, like 0.01, were recommended when activation functions like sigmoid are involved. The reason given was to keep the weights small in order to be in the part of such activation functions where the gradient is large. So, I played a bit with the factor in the mentioned exercise and, to my surprise, smaller factors turned out to lead to worse results:
Factor 10:
On the train set:
Accuracy: 0.83
On the test set:
Accuracy: 0.86
Factor 0.1:
On the train set:
Accuracy: 0.6
On the test set:
Accuracy: 0.57
Factor 0.01:
On the train set:
Accuracy: 0.4633333333333333
On the test set:
Accuracy: 0.48
It’s great that you are trying experiments like this. You always learn something interesting when you apply the techniques here. It turns out that there is no “one size fits all” silver bullet initialization solution that works well in all cases. It all depends on a) the nature of your data and b) the structure of the model you are training. As you listen to everything that Prof Ng says in Course 2, it is a recurring theme that it’s all about experimentation to figure out what will work in any particular case. He is giving us a tour of the various methods that are available and giving us guidance about how to navigate all these choices in a systematic way.
Another type of experiment you can try is to go back to the Course 1 Week 1 application exercise and notice that they actually used Xavier Initialization in the “deep” case there. Try using the simple init algorithm with randn * 0.01 that they gave us in the Step by Step assignment and the convergence is terrible. You could try the larger multipliers there and see if the problem is just the small size or if it is the per layer size attentuation that you get from the Xavier algorithm.
Thanks for your reply. It confirms my initial thoughts that it depends on both the data and the network architecture. I was wondering if there are any intuitions one could use as a rough guide where to start from. For example, consider a network with the output layer deploying tanh or sigmoid activation function and other hidden layers using ReLU. I was thinking maybe the deeper the network, the larger the weights could be initialized to, since the ReLU contributions would dominate in such a case.
I’m aware of the fact that there is no silver bullet and I’ll keep experimenting.
The assignment asks to add 10 to the W parameters as an experiment, to showcase how starting with too large values in the W params is not ideal. If you jump to the end of the lab, you’ll read in the summary:
Different initializations lead to very different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Resist initializing to values that are too large!
He initialization works well for networks with ReLU activations
The 3rd conclusion reinforces the notion that very large W params are not desirable.
And as to why your experiment with smaller factors turned into worst result, @paulinpaloalto clearly noted, there is not a one-size-fits-all, so experimentation is the key.