Week 4 - initialize_parameters_deep - w initialisation redefined for Exercise 2

I created the NN model my own, and when doing the exercise after, I noticed a difference in the output. This was due to the new definition being used for the initialize_parameters_deep function

instead of W being initialised with:

  • random numbers * 0.01
    they are instead initialised with:
  • random numbers / the sqrt of the previous layer dimensions.
    Is this a common practice / is there any clear/intuitive rationale behind this change?

full initialisation definition:
after change:
parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) / np.sqrt(layer_dims[l-1])

originally:
parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*0.01

NB I extracted the function definition using:
print(inspect.getsource(initialize_parameters_deep))

Yes, that is a good observation. It turns out they are using a more sophisticated initialization algorithm called Xavier Initialization that we will learn about in Course 2 of this series. The reason they had to do that is that the simple version they had us build in the previous exercise just doesnā€™t work well at all with this model and dataset. Try it and watch what happens. The convergence is terrible. It turns out that choice of initialization algorithm in an important ā€œhyperparameterā€, meaning a choice you need to make. There is no one ā€œsilver bulletā€ solution that works the best in all cases. Prof Ng will explain this in much more detail in Course 2, so please stay tuned for that. Thereā€™s just too much other stuff to cover here in Course 1. As to why they did not mention this in the notebook, Iā€™m not sure, but my guess is they didnā€™t want to reveal that they had given you correct solutions to all the functions in the Step by Step exercise. Just my theory :nerd_face: ā€¦ But perhaps the simpler reason is what I alluded to before: thereā€™s just too much new material to cover in one course so they are saving it for later. Of course if you had used their sophisticated init code, it would have failed the grader in the Step by Step exercise. :laughing:

1 Like

Thank you so much for the quick reply Paul! I had suspected it was something along those lines with this modified initialisation technique effectively acting as another hyperparameter, and itā€™s great to get a confirmation of this.

The course is really fantastic, so once Iā€™ve finished my other courses Iā€™ll be sure to come back here and finish off the other modules!

Having experimented a little, I canā€™t wait to start exploring all the methods to select hyperparameters, as the practice I did on other datasets really highlighted how difficult it is to select them well!

Yes, how to make and evaluate hyperparameter choices in a systematic fashion will be major focus of the first two weeks of Course 2 and essentially all of Course 3, although Course 3 also focusses a bit more on the data side of things.

Iā€™ve tried to build the NN on my own too and was scratching my head because of this. But Iā€™ve found out about this in the dnn_app_utils_v3.py.

True. With the original initialization, cost gets ā€˜stuckā€™ at around 0.64 from iterations 600 and beyond. I was amazed to see such a huge impact from seemingly such a small change. Butterfly effect! So glad to see this has been discussed already.