Hi, I just finished watching DLS1-W3’s video on random initialization. The video mentioned that if we initialized the weights to a zero matrix, then each neuron within the layer would learn the same thing. However, why is initializing to a random, non-zero matrix better than initializing to an identity matrix (or the identity matrix multiplied by a small number)?
Thanks!
The “weight” matrices here are typically not square, so the identity matrix is not really an option. Unless you meant a matrix full of ones. But the point is any uniform value will give the same results for every neuron. So the problem is not really zero values per se: it is “symmetry” of the initialization that is the problem. Here’s a thread about the mathematics behind why “symmetry breaking” is required.
It turns out that there are a number of different algorithms you can use for random initialization, but that is a more advanced topic that Prof Ng will cover in Course 2 of this series, so please “stay tuned” for that.
Thank you! That’s very helpful.