Hello everybody,
I see that in the assignment where we build a NN from scratch, we randomly initialize the weights and then we multiply them by 0,01.
I understand the randomization: it’s to break the symmetry.
I don’t understand why we multiply them by a small number (0.01). Is it to make the weight converge faster? In this case why?
Thank you,
Riccardo
Hello @Riccardo_Andreoni
we are multiplying the initial weights by 0.01, so that Z initially places near a good value of g(Z).
For more details kindly check Symmetry Breaking versus Zero Initialization
regards
Jenitta
It turns out that there are a number of different algorithms for random initialization and there is no one “silver bullet” version that works best in all cases. If you actually look at the provided utility routines in the C1 Week 4 Assignment 2, you’ll see that they actually needed to use a more sophisticated algorithm called Xavier Initialization that we will learn about in Course 2 of this series. You should go back and try using the “multiply by 0.01” method from the Step by Step assignment in the L Layer case and watch how much worse the convergence is.
The general answer to the question is that they’ve tried lots of different alternatives and it turns out that smaller values generally work better. Note that one issue with larger values is that may cause problems with large absolute values of the Z linear output, which can end up “saturating” the sigmoid function. Even with 64 bit floating point, it’s pretty easy to get a z value that causes sigmoid to round to exactly 1, which makes the cost function return NaN. All it takes to hit that is z > 36.