One other thought here is that initialization algorithms matter more than you might intuitively expect. If you are just using the simple version of initialize_parameters_deep that they had us build in DLS C1 W4 A1, then you should also try a more sophisticated init function for your case with more layers. Take a look at the actual algorithm they used in DLS C1 W4 A2 for initialize_parameters_deep. It’s a version of the more sophisticated “He” initialization that Prof Ng shows us in DLS C2 W1. From the DLS C1 W4 A2 notebook, just click “File → Open” and have a look at the utility functions python file.
And just as an illuminating experiment, try training the 4 layer network in C1 W4 A2 with the simple init function from W4 A1 and see how bad the convergence is compared to the “He” initialization. As I said above, it’s surprising and a bit counterintuitive that it makes that much of a difference.
Yes, that should be helpful. Besides zero gradients, also look for any strange patterns - such as same gradient values across samples in the same iteration, or the same gradient values across iterations. They are all useful pointers, and should be clearly seen, if any, once you print the gradient values out.
I implemented the gradient checking and everything went very well.
However, the very cool thing was to experience the impact of the parameter initialization on model performance. As you already suggested, this made the difference.
First, not initializing b[l] to zero enabled the model to learn. Also if the number of layers increased. However, it learned later as documented with this cost graph below .
Using the He initialization for W[l] then really did another big impact. It enabled the model , with more layers, to learn faster and the cost also went down.
This means, that everything is clear for now. Thank you very much for your support. I hope this thread also helps other people with similar problems.
That’s great news that the He Initialization worked so much better. As I mentioned earlier, it’s counterintuitive and almost shocking that such a seemingly small change in how you initialize would have such a big impact. We depend on the work of the researchers like Prof Ng and his colleagues who figured all this stuff out and thanks to Prof Ng for giving us such a good survey of the techniques.