Say we are dealing with a logistic cost that is a convex function, I am confused how Neural Networks in each layer though initialized with different parameters, how come each neuron does not come up with the same output after gradient descent? I don’t understand how the neural network doesn’t collapse, and I feel like I am missing some intuition of how the neural networks learn patterns by themselves.
Because of the non-linear activation function in the hidden layers, the NN cost function is not convex.
Given different initial values, each weight will follow its own trajectory to a value that minimizes the cost.
It seems like magic, and I don’t have a mathematical proof. But it works.
Your intuition is right under special circumstances where the gradients wrt each weight in a layer will be identical if all the weights are initialized the same constant. This Stack Overflow discussion gives you some more intuition.