In course 2, ‘Setting up your Optimization Problem’ section, ‘Weight Initialization for Deep Networks’ video, time ‘01:10’, Doctor Ng. says:
So in order to make z not blow up and not become too small, you notice that the larger n is, the smaller you want Wi to be, right? Because z is the sum of the WiXi. And so if you’re adding up a lot of these terms, you want each of these terms to be smaller.
I have two problems with the above statement:
Why would we want each of the terms WiXi be smaller in order to reach smaller z? The above statement would be true if know that all of the WiXi terms were positive, so in that case when adding many positive terms, then we would want each of the terms to be smaller in order to achieve smaller z, but here at our deep neural network case both Xi and Wi can be negative and the term WiXi can get negative sometimes so we might have WiXi terms like below:
-999.99, 1000, -100.99, 101, …
The sum of above terms is: 0.02 which is small value for z despite large values of each WiXi !
I just used exaggerated numbers to show it’s possible to have large WiXi and still have small z
I think one answer to above problem would be: yes we can have large WiXi and still have small z due to negative and positive signs but it is about probabilities, when each WiXi is smaller the probability of having smaller z is higher than when we have large WiXi and hoping for equal magnitude of positive numbers and negative numbers, but how do we know about the probability of this? each term WiXi is consisted of Wi which we set its distribution and Xi which we don’t know anything about its distribution and we can’t set it since it’s gathered from outside. This means we will have:
Gaussian Random Variable (Wi) * Unknown Variable (Xi)
However the unknown variable Xi is usually normalized (for having more spherical cost function) so it usually has mean 0 and variance 1 and also mentioned as Gaussian Variable too so we have:
WiXi = Gaussian Random Variable (Wi) * Gaussian Random Variable (Xi)
Wi: mean of 0 and variance of 1/n
Xi: mean of 0 and variance 1
This means Wi is small number (if n=100, standard deviation is 1/10 and 99.7% of Wi is between -3 sigma and +3 sigma which means 99.7% of Wi’s will be between -0.3 and +0.3)
Xi is also Gaussian variable with variance 1 mean 0, this also means 99.7% of Xi is between -3 and +3
two small numbers times each other results extremely small number like:
0.01 * 0.01 = 0.0001
How doesn’t this result gradient vanishing?
Please answer me on both problems, thanks.