When we initialize parameters we were told to multiply W by 0.01 so that Z initially places near a good value of g(Z).
I had a go at converting this week’s practice into a program that will run on my own computer and noticed that my model was not training at all. This was fixed by changing the function update parameters to use the parameter dictionary passed into the function rather than making a copy.
Once it was training it did not train quickly at all so I took a look at the code in the jupyter notebook and saw that the code used included dividing W by the square root of the number of nodes in the layer, instead of multiplying by 0.01.
/ np.sqrt(layer_dims[l-1])
was the division that was used. Is there a reason my code would have trained super slowly using * 0.01 instead and why is this used in place?
Thanks,
Tom
This is an interesting question! It turns out that initialization matters. They didn’t make a big point of this in this exercise I think for a couple of reasons:
-
There’s just too many potential things to discuss, so they are limiting the complexity here in Course 1 and are saving that topic for Course 2.
-
If they pointed out that they were using a different algorithm, it would call attention to the fact that the second assignment in Week 4 provides you with worked solutions for the first assignment. I think they were hoping that people would naively assume that they could just call their functions from the other notebook. That turns out not to be technically possible, so they needed to include a “utility” python file with those functions.
It just turns out that with the particular 4 layer architecture that they chose and with the particular training data that they have here, the simple initialization algorithm that they taught us about in the Step by Step exercise just doesn’t work very well. So they ended up using a more sophisticated algorithm called He Initialization, that will be covered in Week 1 of Course 2 in this series. And then they didn’t mention it probably for some combination of the reasons that I suggested above.
As you will learn in more detail in Course 2, the selection of initialization algorithm is a “hyperparameter”, meaning a choice that the system designer needs to make, not something that can be automatically learned. It turns out that there is no one magic “silver bullet” solution that works best in all cases. Sometimes a simplistic solution like a Gaussian distribution times a small multiplier works just fine, but in other cases you need to try alternatives like Xavier or He Initialization and yet other choices. Please stay tuned to hear what Prof Ng says about this in Course 2.