Thanks @shanup !
This thread is definitely a duplicate of the thread you linked in your answer
It took a while to understand why different initial values would result in different weights. The key seems to be that the rate of gradient descent for each feature is dependent on it’s initial weight. As such, if each neuron has differing initial weights, the rate at which their weights are being updated via gradient descent will also differ, and this might cause a particular neuron to approach a different local minimum then other neurons.
I think my assumption that all neurons would converge to the same weights was based on the assumption that there’s some single minimum that can be reached regardless of the “direction of descent” (i.e. the classic “soup bowl” of a two feature scenario).
If this is the case, then I would still assume all neurons would converge to the same weights, but I guess in reality, especially with more features, such a “single minimum” is extremely unlikely.
Or would the neurons in a 2 feature “soup bowl” scenario still end up with different weights somehow?