In the video we learn about initializing weights with regard to small variance, so the weights won’t vary by much from 1. This is done by multiplying the random elements of w by sqrt(2/n), thus also reducing the mean. I understand the point with the variance, but I don’t understand why the result is more “centered” around 1. np.random.rand() outputs a uniform distribution between 0 and 1 so the mean is 0.5. After the multiplication wouldn’t the elements be even smaller and of course not around 1? And if we take Gaussian distribution for instance, the mean is 0 and so the “centering” would be around 0.

Unless I’m missing something here, I would expect that the argument above would work only if the mean of the random sample is 1 and to my understanding this is not the case.

Can you give us a reference to where the statement is made that the intent is to center the data around 1? I agree with your statement that this doesn’t seem to make sense. If it’s in the lectures, please give us the name of the lecture and the time offset.

Ok, I listened to that section and I think the confusion stems from the fact that when he talks about something staying reasonably close to 1, he is not talking about the weight values: he means the result of the linear combination which is z, right? Listen again and notice that he says that he’s basing these estimates on the assumption that the input values x_i have \mu = 0 and \sigma = 1. So the goal is to keep the absolute value of z close to 1 by limiting the weight values by multiplying by that scale factor with the number of terms in the sum in the denominator.

Note that Prof Ng always uses a Gaussian distribution (np.random.randn, notnp.random.rand) for the random initialization values with \mu = 0, so multiplying by a factor changes only the variance, not the mean.