How is training finding different weights for ReLU units that are all initialized to 0?

I was playing around with initial weights in C2_W2_Multiclass_TF and I expected that setting all the weights to 0 before training would result in all the units getting trained to the same parameters, as explained in DeepLearning Initializing Neural Networks : “Initializing all the weights with zeros leads the neurons to learn the same features during training.”

But that didn’t happen! After setting all the weights to 0, then training, the training result was different parameters for each unit and it found approximately the same parameters as the original training (before setting initial weights to 0).

How did it do that? What is happening?! Please help me if you can, this is driving me completely nuts.

model = Sequential(
[
Dense(2, activation = 'relu',   name = "L1"),
Dense(4, activation = 'linear', name = "L2")
]
)

model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)

# Model wont allow .set_weights until this is ran once
model.fit(
X_train,y_train,
epochs=200
)

# Look at the resulting weights
model.get_weights()
[array([[ 1.22,  0.6 ],
[ 0.92, -1.7 ]], dtype=float32),
array([1.59, 1.5 ], dtype=float32),
array([[-2.01, -3.07,  1.3 ,  0.33],
[-2.83,  1.09, -1.89,  0.69]], dtype=float32),
array([ 3.18,  0.21, -1.31, -2.61], dtype=float32)]

# Set all the weights to 0 for all units in all layers
model.set_weights([
np.array([[ 0.0,  0.0 ],
[ 0.0, 0.0 ]]),
np.array([0.0, 0.0 ]),
np.array([[0.0, 0.0,  0.0 ,  0.0],
[0.0,  0.0, 0.0,  0.0]]),
np.array([ 0.0,  0.0, 0.0, 0.0])
])

# Re-train
model.fit(
X_train,y_train,
epochs=100
)

# Print the new weights
model.get_weights()

# Resulting weights
[array([[ 1.28,  0.45],
[ 0.71, -1.69]], dtype=float32),
array([1.48, 1.51], dtype=float32),
array([[-2.06, -2.33,  1.11,  0.42],
[-1.9 ,  1.3 , -1.94,  0.59]], dtype=float32),
array([ 2.47, -0.52, -0.46, -2.14], dtype=float32)]

I think your experiment is just invalid. TF must be doing something else that messes up the results, e.g. that there are non-zero gradients still left over from the first training run. It is strange to run the training once and then set the weights and then train again. The better way to run the experiment would be to use the kernel_initializer keyword argument when you define the layers. E.g.

model = Sequential(
[
Dense(2, activation = 'relu',   kernel_initializer = 'zeros', name = "L1"),
Dense(4, activation = 'linear', kernel_initializer = 'zeros', name = "L2")
]
)

If you do that, then you don’t have to do the disruptive thing of running the training twice. Please give that a try and see if it changes the results. Note that you also need to initialize the bias values to be zeros, but that is the default. If you want to be sure, you can also add

bias_initializer = 'zeros'

on both layers. For more info, here’s the docpage for Dense. Note that you can also break symmetry with zero weights and non-zero biases.

Here’s a thread from DLS that discusses Symmetry Breaking and why it is not needed in Logistic Regression, but is needed for real Neural Networks.

Thank you Paul! Initializing the model that way did the trick - after training, all the weights were still at zero. The world makes sense!

I also read through the DLS thread you linked - a great explanation. I enjoyed the extra information to get my head around how these initializations affect gradient descent.

3 Likes

Whew! It’s great to hear that the experiment now produces the results that we expected. It always feels better when things make sense.

3 Likes

Can’t agree more with you, Paul.

1 Like