Hello @jimming , it’s a great question. In short, **neurons differentiated after training, because they are different at the beginning – they have different initial parameter values**. In other words, you can make sure they do not differentiate by setting some neuron parameters to be the same **at initial**. I will show you how at the end.

Let’s explain why they can differentiate with a setting of a layer of 2 neurons, followed by a layer of 1 neuron, as illustrated by the following graph.

From the left, the input has 2 features, and as you said, a copy of them is sent to each of the 2 neurons in the 1st layer. The internal work of each neuron is shown by the 2 maths equation in each of the neuron which you may have been familiar with after W1. The outputs a_1, a_2 are fed to the 2nd layer, same internal work and then producing a_3 for calculating the loss.

Above is the forward phase. Next is the key - the backward propagation phase, because our question is how can neuron weights (w_1, w_2, …) change differently, **and such difference is soley determined by the gradient** (\frac{\partial{J}}{\partial{w_1}}, …), because this is the rule of how a weight is updated: w_1 := w_1 - \alpha\frac{\partial{J}}{\partial{w_1}}.

So we can’t avoid it to look at \frac{\partial{J}}{\partial{w_1}}, by chain rule, which can easily thought of as multiplying a chain of gradients tracing back from J to w_1, and so we have:

OK, how to read the chain rule for \frac{\partial{J}}{\partial{w_1}}? J depends on a_3 which depends on z_3 which depends on a_1 which … until … which depends on w_1. You can follow such chain in the network graph at the top from back to forth.

Here I specifically calculated \frac{\partial{z_3}}{\partial{a_1}} and \frac{\partial{z_1}}{\partial{w_1}} (which are w_5 and x_1 respectively), because if you compare the same calculation of other gradients, they are the reason why the gradients are different! Can you see that?

For example,

\frac{\partial{J}}{\partial{w_1}} and \frac{\partial{J}}{\partial{w_3}} are different because w_5 and w_6 are different.

\frac{\partial{J}}{\partial{w_1}} and \frac{\partial{J}}{\partial{w_2}} are different because x_1 and x_2 are different.

Since the gradients are different, given that w_? := w_? - \alpha\frac{\partial{J}}{\partial{w_?}}, the weights are updated differently!!

Now, as I promised at beginning, here is a way to make sure those weights can’t differentiate, as you may already notice, you only need to make sure something like w_5 = w_6, below is a code snippet for you to achieve that:

note that w_1 = w_3 and w_2 = w_4 **before and after the training**, given this, the two neurons in the first layer **do not differentiate**!

P.S. I dropped bias term in the above discussion to make it simpler. But the idea does not change even including the bias term.