In the video, Andrew disucusses multiplying the randomly initialized W matrix by the variance of the number of inputs. He says we do this to keep the value of Z small. He says " larger n → smaller Wi.
He never explains WHY we want this. Can someone explain?
If that’s the question (why you want Z to be relatively small), it’s been a while since I watched those actual lectures, but I’m pretty sure he does explain it: the point is vanishing gradients when you get out onto the “tails” of sigmoid, which is the output activation here, isn’t it? And the other point is the larger (more layers) your network the more you’d have a “compounding problem”: if you multiply more big numbers (both larger W_{ij} values and larger intermediate input Z values in the various hidden layers), it only gets bigger, whereas if you multiply numbers < 1, they get smaller, thus keeping you in the relatively linear region of sigmoid where you get nice gradients and thus better convergence.
As Prof. Ng explained at timestamp 2:20, changing to 2/n would work better when the ReLu activation function is used, helping the network’s gradient not to explode or vanish quickly.
Right! Then the higher level point that we can generalize from this situation is that there is no one “magic bullet” answer that works best in all cases. A lot of research and experimentation has been done and we now have a suite of initialization functions that are provided by the various frameworks like TensorFlow and PyTorch. When you are designing a solution to a particular problem, you can use the guidance as Prof Ng gives in the lecture but some experimentation may still be required to select the type that works best in your situation.
Hi Kic, thank you for the reply. My question was why set the variance to 1/n or 2/n? Specifically, why set it to 1/n in the first place? I understand the logic from 1/n to 2/n is due to RELU, and I understand that the previous step, large n leads to smaller w_i, but why set the variance to 1/n?
The point is the larger the number of layers, the more chance you have that the Z values will get larger in absolute value as you compound the layers, if you leave the initial magnitudes of the weights as constant across all the layers. Given that we are dealing with classifiers here, we always have sigmoid or softmax as the output activation, meaning that the “flat tails” of sigmoid are a problem for vanishing gradients. So one important way to avoid getting out on the tails of the function is to keep the |z| smaller. Prof Ng addresses that in some detail in the lectures. So the factor of \frac {1}{n} is a way to compensate by reducing the magnitude of the weights as you go further out in terms of numbers of layers. That’s the general idea and then you get into the details of whether \frac {1}{n} or \frac {2}{n} or \frac {1}{\sqrt{n}} or yet some other formulation works better in a given scenario. As mentioned earlier, there is no one “silver bullet” answer that works best in all cases. Prof Ng is just showing us some of the solutions that have been seen to work well in enough cases that they are worth trying.