Weight Initialization for Deep Networks WiXi

In course 2, ‘Setting up your Optimization Problem’ section, ‘Weight Initialization for Deep Networks’ video, time ‘01:10’, Doctor Ng. says:

So in order to make z not blow up and not become too small, you notice that the larger n is, the smaller you want Wi to be, right? Because z is the sum of the WiXi. And so if you’re adding up a lot of these terms, you want each of these terms to be smaller.

I have two problems with the above statement:

  1. Why would we want each of the terms WiXi be smaller in order to reach smaller z? The above statement would be true if know that all of the WiXi terms were positive, so in that case when adding many positive terms, then we would want each of the terms to be smaller in order to achieve smaller z, but here at our deep neural network case both Xi and Wi can be negative and the term WiXi can get negative sometimes so we might have WiXi terms like below:
    -999.99, 1000, -100.99, 101, …
    The sum of above terms is: 0.02 which is small value for z despite large values of each WiXi !
    I just used exaggerated numbers to show it’s possible to have large WiXi and still have small z

  2. I think one answer to above problem would be: yes we can have large WiXi and still have small z due to negative and positive signs but it is about probabilities, when each WiXi is smaller the probability of having smaller z is higher than when we have large WiXi and hoping for equal magnitude of positive numbers and negative numbers, but how do we know about the probability of this? each term WiXi is consisted of Wi which we set its distribution and Xi which we don’t know anything about its distribution and we can’t set it since it’s gathered from outside. This means we will have:
    Gaussian Random Variable (Wi) * Unknown Variable (Xi)
    However the unknown variable Xi is usually normalized (for having more spherical cost function) so it usually has mean 0 and variance 1 and also mentioned as Gaussian Variable too so we have:
    WiXi = Gaussian Random Variable (Wi) * Gaussian Random Variable (Xi)
    Wi: mean of 0 and variance of 1/n
    Xi: mean of 0 and variance 1
    This means Wi is small number (if n=100, standard deviation is 1/10 and 99.7% of Wi is between -3 sigma and +3 sigma which means 99.7% of Wi’s will be between -0.3 and +0.3)
    Xi is also Gaussian variable with variance 1 mean 0, this also means 99.7% of Xi is between -3 and +3
    two small numbers times each other results extremely small number like:
    0.01 * 0.01 = 0.0001
    How doesn’t this result gradient vanishing?

Please answer me on both problems, thanks.

Maybe not a full answer, but here are some thoughts:

  1. I think Prof Ng is just talking about the magnitudes (absolute values) of the W^{[l]}_{i,j} values. As you say, both the features and the coefficients can be either positive or negative. But you want to keep the expected value of the linear combination relatively small. Note that in the special case of image data, the usual method of preprocessing the inputs is to divide the pixel values by 255, so you do end up with positive feature values in the range [0,1].

  2. The point here is about gradients, right? He’s assuming sigmoid at the output layer, so if the Z values are small in absolute value (< 1), then you’re in the region of sigmoid where the derivative is relatively large. It is for inputs with absolute values >> 1 where the tails of the function flatten out, right? That’s where you get vanishing gradient problems.

1 Like

Thanks paulinpaloalto
I got the first answer of yours completely
But for the second answer, in some part of the lecture Dr. Ng says you can use 2/n when you use ReLU as your activation function and it works better, so I think what he said about WiXi and z values are also true about when we use ReLU activation functions, actually our gradients never explode on ReLU activation function because it is either 1 or 0, why should we care about exploding gradients? shouldn’t we focus only on vanishing gradients when we use ReLU?

It’s not only the derivatives that matter in the computation of the gradients, right? The coefficients show up there as factors as well. So even if you are multiplying them by 1 from the ReLU gradients, you are still doing Chain Rule multiplies all the way “out” to J. The exploding gradient problem is more likely to happen in very deep networks, as is the vanishing gradient problem. The product of numbers with absolute value > 1 gets bigger in absolute value the more factors you have and the product of numbers with magnitudes < 1 gets smaller the more you multiply together. So in a deep network you have to worry about both. There may be a “Goldilocks” sweet spot that you have to tune your initialization algorithm to hit. Or you can use techniques like the skip connections in Residual Net architectures to get the gradients to behave better.

1 Like