Improving Deep Neural Networks - WK1 - Video: Weight Initialization for Deep Networks

Matt_Samelson · April 15, 2024, 1:15am

In the video, Andrew disucusses multiplying the randomly initialized W matrix by the variance of the number of inputs. He says we do this to keep the value of Z small. He says " larger n → smaller Wi.

He never explains WHY we want this. Can someone explain?

Thank you

TMosh · April 15, 2024, 2:31am

Can you give a time mark in the video where he mentions the variance of the number of inputs?

paulinpaloalto · April 15, 2024, 2:39pm

If that’s the question (why you want Z to be relatively small), it’s been a while since I watched those actual lectures, but I’m pretty sure he does explain it: the point is vanishing gradients when you get out onto the “tails” of sigmoid, which is the output activation here, isn’t it? And the other point is the larger (more layers) your network the more you’d have a “compounding problem”: if you multiply more big numbers (both larger W_{ij} values and larger intermediate input Z values in the various hidden layers), it only gets bigger, whereas if you multiply numbers < 1, they get smaller, thus keeping you in the relatively linear region of sigmoid where you get nice gradients and thus better convergence.

Zijun_Liu · June 17, 2024, 3:47pm

Hi, I have a related question. Why do we want to set var(w_i) to 1/n or 2/n?

Kic · June 17, 2024, 4:29pm

Hi @Zijun_Liu,

As Prof. Ng explained at timestamp 2:20, changing to 2/n would work better when the ReLu activation function is used, helping the network’s gradient not to explode or vanish quickly.

paulinpaloalto · June 17, 2024, 4:52pm

Right! Then the higher level point that we can generalize from this situation is that there is no one “magic bullet” answer that works best in all cases. A lot of research and experimentation has been done and we now have a suite of initialization functions that are provided by the various frameworks like TensorFlow and PyTorch. When you are designing a solution to a particular problem, you can use the guidance as Prof Ng gives in the lecture but some experimentation may still be required to select the type that works best in your situation.

Zijun_Liu · June 17, 2024, 5:00pm

Hi Kic, thank you for the reply. My question was why set the variance to 1/n or 2/n? Specifically, why set it to 1/n in the first place? I understand the logic from 1/n to 2/n is due to RELU, and I understand that the previous step, large n leads to smaller w_i, but why set the variance to 1/n?

paulinpaloalto · June 17, 2024, 5:12pm

The point is the larger the number of layers, the more chance you have that the Z values will get larger in absolute value as you compound the layers, if you leave the initial magnitudes of the weights as constant across all the layers. Given that we are dealing with classifiers here, we always have sigmoid or softmax as the output activation, meaning that the “flat tails” of sigmoid are a problem for vanishing gradients. So one important way to avoid getting out on the tails of the function is to keep the |z| smaller. Prof Ng addresses that in some detail in the lectures. So the factor of \frac {1}{n} is a way to compensate by reducing the magnitude of the weights as you go further out in terms of numbers of layers. That’s the general idea and then you get into the details of whether \frac {1}{n} or \frac {2}{n} or \frac {1}{\sqrt{n}} or yet some other formulation works better in a given scenario. As mentioned earlier, there is no one “silver bullet” answer that works best in all cases. Prof Ng is just showing us some of the solutions that have been seen to work well in enough cases that they are worth trying.

Topic		Replies	Views
Weight initialisation Improving Deep Neural Networks: Hyperparameter tun week-1 , coursera-platform	3	48	July 1, 2024
Weight Initialization for Deep Networks: why aim for Var(W_i) = 1/n Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	546	February 9, 2022
Weight Initialization for Deep Networks Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	634	July 1, 2021
Weight Initialization for Deep Networks (Matrix W) Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	550	January 13, 2022
Parameter initialization question Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	593	February 18, 2023

Improving Deep Neural Networks - WK1 - Video: Weight Initialization for Deep Networks

Related topics