What makes different neurons calculate different parameters within a layer?

From what I understood from one of the lectures of MLS course 2 week 1, the neurons in a layer are like logistic regression units.

Since each such logistic regression unit is getting the same feature vector or input layer as input, why do they output different activations? Or what makes them do that?

4 Likes

The units in a hidden layer have the same input features, but they each use different weights. So they each give different outputs to the following layer.

2 Likes

Hi @Rohit_Pandey1

welcome to the community and thanks for your question!

Here you can find an illustration that might be relevant for you:

You can see that each neuron has individual weights and bias that are learned by the neuron within the training process. These parameters are different for each neuron and they lead (via activation function) also to a different output of each neuron.

Best regards
Christian

2 Likes

Hi @Rohit_Pandey1

Each neuron in a layer receives the same input feature vector, but the neurons are different because they have different weights and biases. The weights and biases are learned during the training process and are different for each neuron.

When the input feature vector is multiplied by the weights and added with the bias of a neuron, the output is a scalar value, also called the activation. The activation function is then applied to the scalar value to produce the final output of the neuron.

The activation function is a non-linear function that is applied element-wise to the output of the previous layer, and it’s used to introduce non-linearity in the network. Common activation functions include sigmoid, ReLU, and tanh. The choice of activation function depends on the task and architecture of the network.

For example, the sigmoid activation function squashes the input values into the range of 0 and 1, making it useful for binary classification tasks. On the other hand, ReLU (Rectified Linear Unit) activation function sets negative values to zero, making it useful for deep networks.

Since each neuron has its own set of weights and biases, and the activation function is applied to the output of the previous layer, each neuron will produce a different output, even though they receive the same input feature vector. This allows the network to learn different features from the input data, which helps to improve its ability to generalize to new data.

Regards
Muhammad John Abbas

2 Likes

Thanks @Christian_Simonis .

I am still not sure I completely understand how different neurons of the same layer learn different weights and biases during training, but I guess it has something to do with the neurons in the next layer? Back-propagation?

For instance,


here the professor mentioned that it was possible to find what exactly each neuron was recognizing in the input image.

I guess I will go ahead with the course for now and come back to this later because it looks like it involves some advanced mathematics.

Thanks for your patience.

Regards,
Rohit

1 Like

Hi @Rohit_Pandey1,

I guess it’s fair to go ahead as you suggested. I would still recommend to think and reflect about the Initialization of parameters a bit. Possible questions you want to give some thoughts:

  • what happens if all initialized parameters would be the same?
  • what if they are too large?
  • what if they are too small?
  • strategies for proper Initialization
  • …

On this note you can find a lot of useful answers here and can practice also a little bit with toy examples: Initializing neural networks - deeplearning.ai

Please let us know if something is unclear or if you need support, @Rohit_Pandey1!

Best regards
Christian

2 Likes

You could also apply a heatmap approach to interpret in a visual way and see in which areas the layer were activated and contributed to the output, see also: Get Heatmap from CNN ( Convolution Neural Network ), AKA CAM | by Seachaos | tree.rocks

Also: In figure 1) you find a nice illustrative visualization how low level features like edges are hierarchically combined and enhanced to describe more advanced patterns to finally form objects: see the following Source.

see this Thread.

Hope that helps, @Rohit_Pandey1!

Best regards
Christian

1 Like

Thank you for the resources!

1 Like

Sure! You are welcome. If you have further questions, pls. do not hesitate to ask :slightly_smiling_face:

Happy learning!

Best regards
Christian

1 Like

Have you found the answer?? I have been searching the web for an answer for the last 2 days

What is your question?

tmp_bc3d4ee4-dd33-47a8-be46-29af61aa39e1
Each neuron in the first layer represents a Logistic regression model g(z) where z is a linear regression function. Input X is applied to all neurons We are using Gradient Descent as a learning algorithm The model mentioned above has a cost function that has a global minimum so no matter what the initial values of the parameters are, Gradient Descent will always converge to the global minimum (and since all the neurons represent the same model and the same learning algorithm is applied to all of them they all should have the same parameters and this will cause all of them to make the same prediction)
Based on all of that, how do neurons in the same layer get different values for the same parameter??
I mean w0 should be the same for all neurons in the same layer and w1 should also be the same for all neurons in the same layer but of course it could be different than w0

All of the unit weights are initialized to different random values. This breaks the symmetry and allows each unit to learn a different weight value.

But if we are using a model with a cost function that is convex, by applying Gradient Descent, the value of the parameters will converge to the global minimum. So w0 of all the neurons in the first layer will be the same no matter what the initial value of them

Your conclusion is not supported by your premise.

Can you please highlight where my mistake is? I know that there is something wrong with what am saying because neural networks would have never worked but I don’t know what is it

That is simply not a true statement, as Tom pointed out. Every neuron starts out with all different weights. Each neuron has a weight for each element of the input and they are different for each input and for each neuron. So the gradients will be different and they will all stay different. If they start different, what would force them to become the same?

Also note that the cost function for a neural network is not convex, but I don’t think convexity is relevant to the point here. The values are different, which makes the gradients different, so the weight values continue to be different.

2 Likes

But how are we calculating these weights? if we have 3 features x0, x1 and x2, this means we have 4 parameters w0, w1, w2, and b. These parameters are calculated for each neuron using gradient descent right? (there might be other algorithms but I am talking in the context of course 2)
I want to ignore the presence of a neural network for a minute and think of it as 3 separate neurons so we are actually training three separate models right? what am I missing here?

Well, it’s just a question of how you look at it or how you define it. Each neuron gets all the inputs and produces its own output. If you want to call that 3 models or 1 model, I guess it’s up to you, but Prof Ng calls it one model.

If we have 3 features and we have 3 neurons, then we have a total of 12 weight and bias values, right? The weights are a 3 x 3 matrix and the bias values are a 3 x 1 column vector. Now I should say that I don’t know the MLS course material, only the DLS course material, which presents a more advanced version of the same material. I’m not sure how Prof Ng represents the data in MLS for this case, but evaluating the output for this layer is the following operation:

W \cdot X + b

Where X is either a 3 x 1 column vector representing one sample or it’s a 3 x m matrix representing m input samples. Then W is a 3 x 3 matrix and b is a 3 x 1 column vector. The output will then be 3 x 1 if X is one sample or 3 x m where X is m samples.

Here again I don’t now whether Prof Ng uses math notation with indexes starting at 1 or python notation with 0-based indexing. You’ve used 0-based for X, so let’s go with that. Here’s what W looks like:

[[w_{0,0}, w_{0,1}, w_{0,2}],
[w_{1,0}, w_{1,1}, w_{1,2}],
[w_{2,0}, w_{2,1}, w_{2,2}]]

Let’s go with just x as one sample, so it’s

[[x_0],
[x_1],
[x_2]]

The output will be 3 x 1 and the first element of it is:

z_0 = \displaystyle \sum_{i = 0}^2 w_{0,i} * x_i + b_0

And so forth for z_1 and z_2. So if we start with all the w_{i,j} values randomly chosen and we run gradient descent, they will all stay different at least in principle. You can’t prove that some of them couldn’t possibly end up the same, but there is no reason that would drive them to be the same. Back propagation is driven by the cost function comparing the output of the predictions with the labels, right?

Also note that I’ve only showed the “linear” portion of the calculation at the first layer of the network. We then apply a non-linear activation function and then feed that output as the input to the second layer of the network, which will have the same structure, but not necessarily have the same number of output neurons.

3 Likes

Of course also note that we don’t really “calculate” the weights: we randomly initialize them for “symmetry breaking”, precisely so that the neurons don’t all end up the same. Then we run forward propagation and see how good the predictions are. If they are not good, which they certainly won’t be with random weight values, then we use the gradients of the cost to incrementally push them (and the bias values) in the direction of a lower cost.

Rinse and repeat until the predictions of the model are good enough to satisfy our requirements. Or we find that we never can get to that “good enough” point, so we need to step back and readjust the hyperparameters, meaning the choices we made about how many layers and how many neurons and which activation functions and the learning rate and all the other choices that we made. It’s not always obvious what the right network is to solve a given problem and coming up with a good solution may require some experimentation.

2 Likes