What is the effect random initialization of W on multiple nodes and in neural network when all of them are doing the same thing

Shabbar_Zaidi · May 8, 2024, 8:53am

In all examples The inputs n[0] to the nodes in hidden layer 1 are same. If lets say we have two inputs and 4 nodes in hidden layer 1 then each node in layer1 will take same two inputs. If we will run the model for longer iteration then will it make all rows of W[1] same?
Does the purpose of initializing W randomly and use of multiple nodes and layers in neural network is to get the revised values of W in less time.
Is my understanding correct that all the functions in the nodes are doing same thing but the inputs are different hence multiple neurons are reducing the training time of the model. If we will use simple Logistic regression instead of neural network then it will take long time to get values of W and b as compared to neural network.

gent.spah · May 8, 2024, 9:01am

The weights are randomly initialized so they differ from neuron to neuron!

A neural network includes several logistic (or others too, like relu, tanh etc.) regressions in parallel and in sequence not just for speed’s sake but also to be able to fit complex models that a logistic regression cannot just accomplish by itself because the curve of the logistic regression is very limited!

Shabbar_Zaidi · May 8, 2024, 9:18am

It means in next classes we will learn using different activation functions on different nodes to improve the model.
Will it change the derivative formula also ?

gent.spah · May 8, 2024, 9:38am

You will see in the classes!

paulinpaloalto · May 8, 2024, 2:38pm

You’re right about the high level point: in each layer, every neuron gets all the outputs from the previous layer as inputs. So if we started out with all the same weights then every neuron would give them same answer and the derivatives would be the same as well. That would mean there would literally be no point in having multiple output neurons.

We need to randomly initialize all the weight values so that they all start out being different and then will continue to learn different things as training takes place. This is called “Symmetry Breaking”. Prof Ng mentions this in the lectures but doesn’t go into a lot of detail. Here’s a thread which discusses in more detail.

paulinpaloalto · May 8, 2024, 2:43pm

We’ve discussed this before, haven’t we? Yes, the gradients are the derivatives of the actual functions that are being used in the various layers, so if you change the activation function, then it changes one of the factors in that Chain Rule calculation. Remember we were talking about formulas like this one on that thread yesterday:

dZ^{[1]} = W^{[2]T}dZ^{[2]} * g^{[1]'}(Z^{[1]})

The term g^{[1]'}(Z^{[1]}) is the derivative of the activation function at layer 1, right? So, yes, if you change the function, it changes the derivative term, which then changes the overall gradients. That’s the reason we write that formula in a general way: we have a choice of activation functions at the hidden layers of the network. In the output layer, we have the advantage that we know what the activation is if it’s a binary classification: it must be sigmoid, so we can just apply the derivative of sigmoid and the cross entropy loss function at the output layer.

Shabbar_Zaidi · May 24, 2024, 8:42am

Thanks alot for explanation. Its clear.

Topic		Replies	Views
Randomly initialize parameter b instead of W Neural Networks and Deep Learning	6	659	August 23, 2022
Random Initalization in Neural Networks Neural Networks and Deep Learning week-3	15	58	September 11, 2024
Question in intro video Advanced Learning Algorithms week-1	4	15	December 20, 2024
Hidden layer first iteration neural network Neural Networks and Deep Learning	2	643	January 18, 2022
Week 2 initialize w's to 0 Neural Networks and Deep Learning	1	560	July 15, 2021

What is the effect random initialization of W on multiple nodes and in neural network when all of them are doing the same thing

Related topics