In the Week 1 video " Weight Initialization for Deep Networks" at 1:10, Andrew says:
“So in order to make z not blow up and not become too small, you notice that the larger n is, the smaller you want Wi to be.
Because z is the sum of the WiXi. And so if you’re adding up a lot of these terms, you want each of these terms to be smaller.”
How does a smaller Wi help z not vanish?
I can understand how you want a smaller Wi to tame the potential exponential growth of z, but isn’t it the opposite if you want to avoid the exponential shrinking of z? Doesn’t Wi need to be bigger for big n?
When Andrew says smaller, does he mean that Wi has to be closer to 1?
Honestly, I don’t know if these questions make sense, since I don’t have any experience with deep networks and that video was brief, but I was just wondering.
I think the point here is that there is a sweet spot in the initialization. The quote:
Just means that, in general, the more terms you have (ie. XiWi) the smaller you want them to be, as to guard against exploding gradients. However, it should note also that although you want them to be small, they cannot be way too small to guard against vanishing gradients.
In my experience, it is easier to get exploding grandients (eg. getting NaNs in Tensorflow) than it is to get vanishing ones (which I tend to associate more to really deep networks).
In short, I think “smaller” in the above quote refers to small just in relation to the overall total of Z.
I found this article, and even though I just read the first couple of sections, I think it explains the issue with weight initialization in an intuitive way.