Weight Initialization for Deep Networks (Matrix W)

In course 2, ‘Setting up your Optimization Problem’ section, ‘Weight Initialization for Deep Networks’ video, time ‘03:30’, Doctor Ng. says:

So if the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause z to also take on a similar scale. And this doesn’t solve, but it definitely helps reduce the vanishing, exploding gradients problem, because it’s trying to set each of the weight matrices w, you know, so that it’s not too much bigger than 1 and not too much less than 1 so it doesn’t explode or vanish too quickly.

I had doubt about Wi generated this way (by Gaussian random distribution with mean of 0 and variance of 1/n) being not too much less than 1 so I wrote a python code that could generate this matrix using the mentioned distribution:

import numpy as np
num_features = 6
num_layer1_neurons = 6
W1 = np.random.normal(loc = 0.0, scale = np.sqrt(2/num_features), size = (num_layer1_neurons, num_features))
print(W1)

After running this code I got this (you might get something else due to randomly picking W1 but yours is also similar to mine):

[[1.232 0.280 0.342 -0.689 0.072 0.092]
 [-0.408 -0.613 -0.618 -0.088 -0.186 0.841]
 [-0.090 0.245 -0.184 -0.638 0.618 0.599]
 [0.433 -0.192 -0.523 -0.256 0.273 0.165]
 [-0.594 -0.338 0.186 1.133 0.351 0.206]
 [-1.314 0.185 0.436 -0.354 -0.358 0.558]]

It is obvious that the elements in this matrix are far more close to zero (due to the zero mean) than being close to 1, is it really true that ‘so that it’s not too much bigger than 1 and not too much less than 1 so it doesn’t explode or vanish too quickly.’ ?

I assume the ‘too much bigger than 1’ is not similar to ‘too much less than 1’, 2 might be too much more than 1 but -1 wouldn’t be too much less than 1 however the difference between -1 and 1 is bigger than 1 and 2. Please clarify the case to me. Thanks !

I listened to that section of the lecture again. I agree it’s a little ambiguous what Prof Ng means there. I think when he says “not to much greater than 1 or too much less than 1”, he’s referring to the Z values, not the W values. The point is that if you’re using tanh or sigmoid as your activation, the gradients are good in the region near |Z| < 1 and only become vanishingly small for |Z| >> 1.

Of course it looks like in most of the cases he shows us here, the most common hidden layer activation is ReLU, which (of course) has totally different behavior than tanh or sigmoid. He makes a comment around 4:00 about how to modify the initialization formula for tanh vs ReLU. It’s common question to ask why ReLU works as well as it does, given that it has the “vanishing gradient” problem like a sledgehammer for |Z| < 0. I don’t know the answer other than to say that you can run the experiment on any given model that you are designing. It looks like the common practice is to start with ReLU, because you can think of it as the “minimalist” activation. If it doesn’t work, then you try Leaky ReLU, which eliminates the 0 gradients and is almost as cheap to compute. If that doesn’t work, only then do you graduate to the more expensive functions like tanh, sigmoid, swish, ELU and so forth.

1 Like

Here is my version of some experimental code to explore what happens here. Note that using 6 features doesn’t really show much because \sqrt{\frac{1}{3}} \approx 0.577, so I used 10 features.

nfeatures = 10
W = np.random.randn(4,nfeatures) * np.sqrt(2/nfeatures)
X = np.random.randn(nfeatures, 4)
print("W = " + str(W))
print(f"mean(W) = {np.mean(W)}")
print(f"variance(W) = {np.var(W)}")
print("X = " + str(X))
print(f"mean(X) = {np.mean(X)}")
print(f"variance(X) = {np.var(X)}")
Z = np.dot(W, X)
print("Z = " + str(Z))
print(f"mean(Z) = {np.mean(Z)}")
print(f"variance(Z) = {np.var(Z)}")
print(f"norm(Z) = {np.linalg.norm(Z)}")

I didn’t set the seed, so you can run multiple times and sample some outputs. Here’s one:

W = [[ 0.19330117 -0.13969598 -0.09184341  0.21729524  0.14779606  0.19497938
  -0.25859968  0.13960351  0.19922055  0.09166032]
 [-0.54188422 -0.04619758  0.06498983  0.93006982  0.06547946 -0.56571506
   0.48386734 -0.4629687  -0.23287257  0.06479251]
 [ 0.14497758 -0.13561573 -0.16443238  0.60654507  0.13360174  0.12859469
  -0.9060832   0.22838903  0.06261081 -0.19325706]
 [-0.1671166  -0.64510047 -0.57014749  0.0355827  -0.52245126  0.48851116
  -0.08472528  0.20287882 -0.43026955 -0.25500086]]
mean(W) = -0.0397307567842581
variance(W) = 0.1300901687189973
X = [[-0.25516547 -1.60973564 -0.94329783  0.22372117]
 [ 0.82113246  0.1510821   0.79061103 -0.03522051]
 [ 1.26375613  1.60345246  1.34454019 -0.71380315]
 [-0.97842341  1.23886804  0.55326244  1.12408354]
 [ 2.5960691   0.30907859  0.03829486  0.10968386]
 [-1.24056123  0.58177334 -0.72387032 -1.04332409]
 [-0.76660172  0.46388207  0.49946812  0.7565807 ]
 [ 1.78096483 -0.2256818  -0.81334032  1.95435391]
 [ 1.06400268  0.74787523 -1.13888748  1.53949277]
 [ 1.83171081  0.80993041  0.28757903 -0.73175253]]
mean(X) = 0.3316396091889521
variance(X) = 1.0180592770403922
Z = [[ 0.47583611  0.02054351 -0.87476924  0.48757506]
 [-1.18030202  2.02013173  2.39067593  0.53225083]
 [ 0.05168247 -0.23154718 -0.98265432  0.71551188]
 [-3.70338564 -1.18925692 -1.26353979 -0.27807952]]
mean(Z) = -0.1880829437669597
variance(Z) = 1.9045979548161187
norm(Z) = 5.571316754308143