C2W1 Individual Neurons and Classification

Hi everyone!

I just finished up the first week of course two and was left with a question.

In the video ‘Demand Prediction’, Andrew goes over how each neuron in a hidden layer (using sigmoid activation) may be used to pull insights from different features in the dataset. He goes on to explain that, rather than explicitly choosing which neuron gets which data, we instead pass the entire dataset to each neuron and each neuron will naturally parse out different information using the parameters we give it.

Here’s what I don’t understand -
If each of these neurons is using sigmoid activation and they all have the exact same data, having different parameters won’t change anything. They would always optimize to the same values, wouldn’t they? If multiple identical neurons produced different results with the same data, wouldn’t that mean the model is broken?

In other words, what exactly allows each neuron to resolve differently when given the same data?

I was hoping this would be clarified over the week, but in each section where we built a hidden layer in raw Python, we were made to skip training the model and instead passed parameters from an already trained TensorFlow model…

Thanks for any help and hope you’re having as much fun as me!

8 Likes

Hello @jimming , it’s a great question. In short, neurons differentiated after training, because they are different at the beginning – they have different initial parameter values. In other words, you can make sure they do not differentiate by setting some neuron parameters to be the same at initial. I will show you how at the end.

Let’s explain why they can differentiate with a setting of a layer of 2 neurons, followed by a layer of 1 neuron, as illustrated by the following graph.

From the left, the input has 2 features, and as you said, a copy of them is sent to each of the 2 neurons in the 1st layer. The internal work of each neuron is shown by the 2 maths equation in each of the neuron which you may have been familiar with after W1. The outputs a_1, a_2 are fed to the 2nd layer, same internal work and then producing a_3 for calculating the loss.

Above is the forward phase. Next is the key - the backward propagation phase, because our question is how can neuron weights (w_1, w_2, …) change differently, and such difference is soley determined by the gradient (\frac{\partial{J}}{\partial{w_1}}, …), because this is the rule of how a weight is updated: w_1 := w_1 - \alpha\frac{\partial{J}}{\partial{w_1}}.

So we can’t avoid it to look at \frac{\partial{J}}{\partial{w_1}}, by chain rule, which can easily thought of as multiplying a chain of gradients tracing back from J to w_1, and so we have:

OK, how to read the chain rule for \frac{\partial{J}}{\partial{w_1}}? J depends on a_3 which depends on z_3 which depends on a_1 which … until … which depends on w_1. You can follow such chain in the network graph at the top from back to forth.

Here I specifically calculated \frac{\partial{z_3}}{\partial{a_1}} and \frac{\partial{z_1}}{\partial{w_1}} (which are w_5 and x_1 respectively), because if you compare the same calculation of other gradients, they are the reason why the gradients are different! Can you see that?

For example,
\frac{\partial{J}}{\partial{w_1}} and \frac{\partial{J}}{\partial{w_3}} are different because w_5 and w_6 are different.

\frac{\partial{J}}{\partial{w_1}} and \frac{\partial{J}}{\partial{w_2}} are different because x_1 and x_2 are different.

Since the gradients are different, given that w_? := w_? - \alpha\frac{\partial{J}}{\partial{w_?}}, the weights are updated differently!!

Now, as I promised at beginning, here is a way to make sure those weights can’t differentiate, as you may already notice, you only need to make sure something like w_5 = w_6, below is a code snippet for you to achieve that:


note that w_1 = w_3 and w_2 = w_4 before and after the training, given this, the two neurons in the first layer do not differentiate!

P.S. I dropped bias term in the above discussion to make it simpler. But the idea does not change even including the bias term.

16 Likes

Thank you so much for taking the time to answer!

6 Likes

You are welcome! Let us know if you have any other questions :slight_smile:

3 Likes

Raymond has given us a great explanation and concrete example of why “symmetry breaking” is required in the initialization of Neural Networks weights. I haven’t taken MLS yet, so I’m not sure what Prof Ng says in the lectures there and whether he uses the term “symmetry breaking”, but this is also discussed in DLS. There he points out that for Logistic Regression, the situation is different and you can actually start with all the weight and bias values being zero (or any other fixed value) and the back propagation can still learn. As with everything, it goes back to the math. Here’s a thread from DLS that goes through the math that might be worth a look after understanding Raymond’s great explanation above.

3 Likes

@paulinpaloalto Thanks a lot for the sharing!!

1 Like

I’m confused where the y would come from for the hidden layer units.

We don’t need any y for the hidden layer units. The cost function requires only the output from the output layer and the corresponding label. The gradients of each of the weights among the layers, including all hidden layers, are calculated based on such cost function.

Raymond

1 Like

Could you expand on what the maths are for this process?

This is the formula for updating a weight. You may read it as w is updated by subtracting its value by \alpha\frac{\partial{J}}{\partial{w}}. Please watch Andrew’s videos in C1 W1 under section “Train the model with gradient descent” for the idea of gradient descent and learning rate.

1 Like

I understand gradient descent, the screenshot I posted is from those lectures. I just don’t understand how it works with multiple layers.

1 Like

I see, so you would be asking about “backward propagation”.

For the maths part, this post in the above has actually done it once. The 1st line in the 2nd photo shows the gradient of a particular weight w_1 in the hidden layer as calculated using the chain rule which essentially multiplies several terms up. The 1st photo gives the meaning of the symbols that are used in the 2nd photo. However, for the 1st line of the 2nd photo, I only showed two of the five partial derivatives, but if you like maths and know differentiation, it should be pretty easy for you to derive the other three.

So, if you like maths and know differentiation, I would suggest you to read that post, otherwise, you may read this post by one of the course’s teaching staff which delievers the idea of backpropagation without too much maths.

Raymond

4 Likes

Sorry for all the confusion. I didn’t understand the way the chain rule was being applied. This video helped me a lot in understanding the maths behind back propagation.
https://youtu.be/tIeHLnjs5U8

It’s great that you have found resources that are best for you, and thank you for sharing it! Doing our own research is helpful!

I think some insights may have been sown in this slide beautifully. In real world scenarios where training is done over a good enough dataset, the performance( accuracy ) of our ML algorithm through a single neuron turns out to be not so good. We could do even better with the power of networks, and as the network becomes larger the prediction tends to be even more.

Please feel free to point out if I am wrong somewhere, as this was my first post!

Hi @Shashank_Garg,

I have the same observations as yours. I assume you refer “a single neuron” to as a linear regression or a logistic regression.

Cheers,
Raymond

PS: It’s great if you want to share your findings, but please post it in a relevant thread or open a new one.

Hello - In C2_W1_Lab02, the code of
tf.random.set_seed(1234)
is used with the comment “applied to achieve consistent results”. After researching a little bit, it seems that this code will force the same initial weights, so that the results will be the same every time. Am I understanding that correctly?

Does adding that line of code serve the same purpose as what you have done above where you explicitly set w1 = w3 = 0.2 et cetera (the purpose of getting the same output parameters and output results when you input the same inputs, every time)?

Lastly, is forcing initial weights done in practice (real world), or how is it handled, is there a best practice?

Hello Navead,

In practice, weights are initialized randomly instead of set explicity. I set weights explicity for demonstrating how to avoid neurons to differentiate.

On the other hand, although weights are initialized randomly, we want to enforce the same set of randomly initialized weights because it is a necessary condition to reproduce the training result. Enforcing reproducibility is a best practice. Setting random seed allows us to achieve this best practice.

Cheers,
Raymond

Hello @jimming . Each neuron considers a different weight for each feature of input. For example, the weight in the first neuron for affordibility is much more than other features should be.