I understand that initializing the Ws to zeros will cause an effect on the outcome of our prediction given that we will have symmetry issue.
However, I am confused as to the ‘He’ initialization is doing better than the regular initialization. I am assuming that this is doing better when we are restricting the number of epochs only? Will the two methods be the same on the long run? (taking into consideration that the # of epochs will be different)
Interesting question! There are many different methods of random initialization and they are not all equivalent. Some of them work better in some (but not all) cases. He Initialization is one of the more sophisticated such algorithms. To see one example of different behavior, try going back to the 4 layer model exercise in C1 W4 A2 and notice that they gave you either He or Xavier Initialization there. If you try using the simple initialization they had us build in C1 W4 “Step by Step”, you’ll find that the convergence is really terrible, but the He init works well.
It turns out that there is no one “silver bullet” solution for random initialization that works best in all cases. As Prof Ng says in the lectures here, the choice of initialization algorithm is another “hyperparameter” that you need to choose as the system designer.
Prof Ng mentions the need for Symmetry Breaking in the Neural Net case, but doesn’t really go into the details. If you want to explore that, here’s a thread which shows why zero initialization works for Logistic Regression, but not for Neural Nets.