Weight Initialisation - random can be better than He?


In week 1’s initialisation programming exercise we are shown that setting random weights too large leads to poor performance, when set with a scale factor of 10. We are then shown the He implementation which performs much better.

However, if you remove the scale factor of 10 and set the weights randomly from normal distribution the performance on the dataset is better (marginally) than the He initialisation.

Random cost:

He cost:

Is He preferred as generally it performs better than random initialisation, or is the takeaway here that we should try both and see what works best on our data?


It depends on the activation functions being used. If you want to read about the theoretical justification for why Xavier initialization is best for tanh, the DeepLearning.ai team has written a great article on the topic:

He initialization for ReLU activation functions follows the same thought process.