Weight Initalization

I have a question regarding the identical and constant weight intialization.
I have read in the notes on this website, as well as in other resources that if the network is intialzed with constant identical weights, then each neuron learns the same weights.
However, when I did the same in tensorflow, i am getting different weights after training for each neuron.
Here’s my code:
multi_linear_model = tf.keras.Sequential([

tf.keras.layers.Lambda(lambda x: x[:,-1:,:]),

tf.keras.layers.Dense(OUT_STEPS*num_features,kernel_initializer=tf.initializers.zeros(),bias_initializer=tf.initializers.zeros()),

= tf.keras.layers.Reshape([OUT_STEPS, num_features])

])

I am basically working with time series data. The first lambda layer just extracts the last time stamp. The second layer is a dense layer that number of neurons defined by the some constants.

The important part is that I have set the weights and biases to zero, But after training, I am getting different weights for each neuron in this layer.

Can someone explain why??Pleas…

Tensorflow automatically initializes the weights to small random values as soon as you create the layer objects.

I have passed the zero initilazers for weigth and biase. After creating the model, I also checked the weights and biases to be zero.
But after training, I am getting different wieghts for each neuron. I can give more details if needed

See my previous reply. When you run the code and it actually creates the layer objects, it randomly initializes the weights.

Are you saying that kernel_initializer argument is ineffective when defining Dense layer?
I printed weight values after the creation of the sequential model.
They were zero.

Interesting. I did not study your code in detail. It has some methods that are unusual.

Perhaps we should wait for another mentor to contribute.

I think the point here is that your “network” is not really a neural network: it is the equivalent of Logistic Regression. Or Linear Regression, since it looks like you are also not including an output activation from what we can see there. It is simply one Dense layer. Here’s an article which demonstrates why zero initialization does not prevent Logistic Regression from learning a valid solution. In DLS Course 2 Week 2, Prof Ng makes the point that you can look at Logistic Regression as a “trivial” neural network, but the fact that it has only a single layer changes the math.

Once you go to multiple cascaded Dense layers, that will no longer be true and Symmetry Breaking will be required.

You could easily test my theory by adding a second Dense layer also with zero initializations and see what happens. That may or may not be what you want for your actual solution, but my bet is that it would clear up the theoretical point you are making here.

1 Like

sure

Thanks a lot.
I did experiment by adding an additional layer and trying different constant values for initialization. It worked as expected. I also managed to get the theory of regression down, and why it wont face this symmetrics issue. The only thing left is understanding the theoretical reasoning for a Multi layer perceptron.
I wasnt aware that a single layer doesn’t face this issue.
Thanks again…

1 Like

Thanks for doing the further investigations and sharing your results.

Taking DLS Course 1 would be a good way to learn about Multi-layer Perceptrons, although Prof Ng uses the more modern terminology and calls them Fully Connected Neural Networks. :nerd_face:

1 Like

sure. I am planning on revisiting some of the consepts taught in the course. I think if I have to summerize this issue, it would be that, there’s no backpropagation happening when there’s only a single layer. Correct me if I am wrong.

Neverthless, I am grateful that these forums with amazing people exist.
May God guide and bless you abundantly.

Gradients still get computed and applied. It’s just that you only have three functions involved: the linear function, the sigmoid activation and the cross entropy loss. But it’s still “back propagation” even with one layer: it goes “backwards” from the loss to the activation to the linear coefficients. There is nothing to update (no “parameters”) in the sigmoid and loss functions, so we only update the w and b from the linear function.

There’s no learning without applying gradients. Well, if you’re doing Linear Regression, then there actually is a “closed form” solution called the Normal Equation. But you’ll see Gradient Descent used in Linear Regression in cases with relatively high dimensions because the computational complexity of the Normal Equation is higher than Gradient Descent, so there can be cases in which GD is a cheaper way to get your solution. Once you graduate to Logistic Regression and multi-layer networks, there is no longer a closed form solution and you’ve got no choice other than some form of Gradient Descent basically. There are other iterative approximation methods like Newton’s Method, but they have a similar flavor (using derivatives to push the solution in a better direction repetitively).

1 Like

You are correct in one way. “backpropagation” usually refers to how the gradients are computed in the hidden layer of a neural network. So since you have no hidden layer, backpropagation isn’t used.

With a simple linear or logistic regression, you still have to compute the gradients in order to find the weights that give the minimum cost. TensorFlow does this for you automatically.

1 Like