I’m wondering how randomly initializing the weights in a neural net prevents the different nodes in a single layer from eventually converging to the same value. Since each of the nodes within a single layer are all taking the exact same input values, would they not all have the same optimum? Even if they don’t reach the optimum simultaneously, would they not all eventually converge to the same optimum?
Hi @seantolino and welcome to the DL Specialization.
You may have accidentally gotten your subject line wrong. Random Initialization is a (necessary) component of convergence to a minimum cost (hopefully a global minimum at that).
The initial parameters need to “break symmetry” between different units. That is, if two hidden units with identical activations are connected to the same inputs, then these units must have different initial parameters. If they do not, the gradient descent algorithm will always update both of the units in the same way. In a sense, the units would be redundant. The algorithm needs tp explore the parameter space for learning to occur.
It’s not the most pleasant of chores, but one can convince oneself of this fact with paper and pencil on a shallow neural network with a single hidden layer.