I created the NN model my own, and when doing the exercise after, I noticed a difference in the output. This was due to the new definition being used for the initialize_parameters_deep function
instead of W being initialised with:
random numbers * 0.01
they are instead initialised with:
random numbers / the sqrt of the previous layer dimensions.
Is this a common practice / is there any clear/intuitive rationale behind this change?
full initialisation definition:
after change: parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) / np.sqrt(layer_dims[l-1])
Yes, that is a good observation. It turns out they are using a more sophisticated initialization algorithm called Xavier Initialization that we will learn about in Course 2 of this series. The reason they had to do that is that the simple version they had us build in the previous exercise just doesnāt work well at all with this model and dataset. Try it and watch what happens. The convergence is terrible. It turns out that choice of initialization algorithm in an important āhyperparameterā, meaning a choice you need to make. There is no one āsilver bulletā solution that works the best in all cases. Prof Ng will explain this in much more detail in Course 2, so please stay tuned for that. Thereās just too much other stuff to cover here in Course 1. As to why they did not mention this in the notebook, Iām not sure, but my guess is they didnāt want to reveal that they had given you correct solutions to all the functions in the Step by Step exercise. Just my theory ā¦ But perhaps the simpler reason is what I alluded to before: thereās just too much new material to cover in one course so they are saving it for later. Of course if you had used their sophisticated init code, it would have failed the grader in the Step by Step exercise.
Thank you so much for the quick reply Paul! I had suspected it was something along those lines with this modified initialisation technique effectively acting as another hyperparameter, and itās great to get a confirmation of this.
The course is really fantastic, so once Iāve finished my other courses Iāll be sure to come back here and finish off the other modules!
Having experimented a little, I canāt wait to start exploring all the methods to select hyperparameters, as the practice I did on other datasets really highlighted how difficult it is to select them well!
Yes, how to make and evaluate hyperparameter choices in a systematic fashion will be major focus of the first two weeks of Course 2 and essentially all of Course 3, although Course 3 also focusses a bit more on the data side of things.
True. With the original initialization, cost gets āstuckā at around 0.64 from iterations 600 and beyond. I was amazed to see such a huge impact from seemingly such a small change. Butterfly effect! So glad to see this has been discussed already.