There’s something really bugging me while learning the Week 2 material. Maybe Andrew touched on this and I missed it, but I haven’t seen any explanation that could answer this yet.

Take the basic neural network example used in the material - 3 layers (25 units (sigmoid) | 15 units (sigmoid) | 1 output unit (sigmoid)).

If the same input matrix X is being passed into each of the 25 units in the first layer, given they’re all using the same activation function (sigmoid), wouldn’t they all evaluate the same output matrix?

I don’t understand how each neuron can represent different “hidden features” if they’re all initialized with the same weights, accept the same inputs and have the same activation function.

I’m assuming that one of these three conditions is not true - either the neurons don’t all accept the same inputs but some subset of the input features, or that they’re initialized with some different weights (although I’d still expect they’d converge to the same optimal weights eventually anyway).

How do neurons within a given layer end up with different weights if they are all accepting the same input and applying the same activation function? Is it something this mysterious back-propagation algorithm is enforcing?

All the weights are randomly initialized, hence they will start out with different values…which puts them on different learning paths.

End result: They do not converge to the same value, as you mentioned. Rather, they end up with different values - And in this whole process they would have learnt different features.

This random initialization of the weights is what breaks their symmetry - Check out the notebook in this article, and you can see for yourself

It took a while to understand why different initial values would result in different weights. The key seems to be that the rate of gradient descent for each feature is dependent on it’s initial weight. As such, if each neuron has differing initial weights, the rate at which their weights are being updated via gradient descent will also differ, and this might cause a particular neuron to approach a different local minimum then other neurons.

I think my assumption that all neurons would converge to the same weights was based on the assumption that there’s some single minimum that can be reached regardless of the “direction of descent” (i.e. the classic “soup bowl” of a two feature scenario).

If this is the case, then I would still assume all neurons would converge to the same weights, but I guess in reality, especially with more features, such a “single minimum” is extremely unlikely.

Or would the neurons in a 2 feature “soup bowl” scenario still end up with different weights somehow?

Seems you are picturing the movement of weights during gradient descent, then we can remind us that the weights are also “interacting” with each other through the cost function. If one of the weights managed to move to a value that can tremendously reduce the cost, the burden on the rest of the weights become much smaller. Indeed the weight might even not be willing to move if making a move will only increase the cost. They don’t intend to get together again, they are just instructed by the cost function (and the derivative of the cost function) to move to wherever that can reduce the cost.