Hello @jimming , it’s a great question. In short, neurons differentiated after training, because they are different at the beginning – they have different initial parameter values. In other words, you can make sure they do not differentiate by setting some neuron parameters to be the same at initial. I will show you how at the end.
Let’s explain why they can differentiate with a setting of a layer of 2 neurons, followed by a layer of 1 neuron, as illustrated by the following graph.
From the left, the input has 2 features, and as you said, a copy of them is sent to each of the 2 neurons in the 1st layer. The internal work of each neuron is shown by the 2 maths equation in each of the neuron which you may have been familiar with after W1. The outputs a_1, a_2 are fed to the 2nd layer, same internal work and then producing a_3 for calculating the loss.
Above is the forward phase. Next is the key - the backward propagation phase, because our question is how can neuron weights (w_1, w_2, …) change differently, and such difference is soley determined by the gradient (\frac{\partial{J}}{\partial{w_1}}, …), because this is the rule of how a weight is updated: w_1 := w_1 - \alpha\frac{\partial{J}}{\partial{w_1}}.
So we can’t avoid it to look at \frac{\partial{J}}{\partial{w_1}}, by chain rule, which can easily thought of as multiplying a chain of gradients tracing back from J to w_1, and so we have:
OK, how to read the chain rule for \frac{\partial{J}}{\partial{w_1}}? J depends on a_3 which depends on z_3 which depends on a_1 which … until … which depends on w_1. You can follow such chain in the network graph at the top from back to forth.
Here I specifically calculated \frac{\partial{z_3}}{\partial{a_1}} and \frac{\partial{z_1}}{\partial{w_1}} (which are w_5 and x_1 respectively), because if you compare the same calculation of other gradients, they are the reason why the gradients are different! Can you see that?
For example,
\frac{\partial{J}}{\partial{w_1}} and \frac{\partial{J}}{\partial{w_3}} are different because w_5 and w_6 are different.
\frac{\partial{J}}{\partial{w_1}} and \frac{\partial{J}}{\partial{w_2}} are different because x_1 and x_2 are different.
Since the gradients are different, given that w_? := w_? - \alpha\frac{\partial{J}}{\partial{w_?}}, the weights are updated differently!!
Now, as I promised at beginning, here is a way to make sure those weights can’t differentiate, as you may already notice, you only need to make sure something like w_5 = w_6, below is a code snippet for you to achieve that:
note that w_1 = w_3 and w_2 = w_4 before and after the training, given this, the two neurons in the first layer do not differentiate!
P.S. I dropped bias term in the above discussion to make it simpler. But the idea does not change even including the bias term.