C2W1 Weight Initialization

RTee · September 2, 2022, 5:37pm

In the video we learn about initializing weights with regard to small variance, so the weights won’t vary by much from 1. This is done by multiplying the random elements of w by sqrt(2/n), thus also reducing the mean. I understand the point with the variance, but I don’t understand why the result is more “centered” around 1. np.random.rand() outputs a uniform distribution between 0 and 1 so the mean is 0.5. After the multiplication wouldn’t the elements be even smaller and of course not around 1? And if we take Gaussian distribution for instance, the mean is 0 and so the “centering” would be around 0.

Unless I’m missing something here, I would expect that the argument above would work only if the mean of the random sample is 1 and to my understanding this is not the case.

paulinpaloalto · September 2, 2022, 6:17pm

Can you give us a reference to where the statement is made that the intent is to center the data around 1? I agree with your statement that this doesn’t seem to make sense. If it’s in the lectures, please give us the name of the lecture and the time offset.

RTee · September 2, 2022, 6:50pm

Around minute 3:30 in the video I was talking about: Weight Initialization for Deep Networks.
https://www.coursera.org/learn/deep-neural-network/lecture/RwqYe/weight-initialization-for-deep-networks.

I would add that the phrase “centering” is my interpretation of what being said, unless I understood incorrectly.

paulinpaloalto · September 2, 2022, 7:33pm

Ok, I listened to that section and I think the confusion stems from the fact that when he talks about something staying reasonably close to 1, he is not talking about the weight values: he means the result of the linear combination which is z, right? Listen again and notice that he says that he’s basing these estimates on the assumption that the input values x_i have \mu = 0 and \sigma = 1. So the goal is to keep the absolute value of z close to 1 by limiting the weight values by multiplying by that scale factor with the number of terms in the sum in the denominator.

Note that Prof Ng always uses a Gaussian distribution (np.random.randn, not np.random.rand) for the random initialization values with \mu = 0, so multiplying by a factor changes only the variance, not the mean.

Topic		Replies	Views
Doubt on weight initialization Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	392	August 19, 2023
C2W1 Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	815	May 28, 2021
Clarification for the variance 1/n Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	3	16	November 11, 2024
Initializing Weights to Mitigate Vanishing/Exploding Gradients Improving Deep Neural Networks: Hyperparameter tun coursera-platform	13	593	October 31, 2021
Weight Initialization for Deep Networks (Matrix W) Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	550	January 13, 2022

C2W1 Weight Initialization

Related topics