In the context of linear regression, the recommendation is to initialize the weight vector (w) to zeros, but I’m struggling to understand the specific purpose behind this choice.

Initializing the weights to zeros doesn’t seem to offer a clear advantage in the initial round of computations. The resulting matrix, when the zero-initialized weight vector is multiplied by the input vector (x), consists of zeros, raising questions about its effectiveness. I’m having a little trouble formulating exactly the problem, but I feel that it blocks the calculation. After all, why actually reset all the input values to zero?

Could you please provide further clarification on why initializing the weights to zeros is beneficial in the context of linear regression?

For basic linear regression (ie. one neuron), there is no clear advantage or disadvantage to initializing the weights to zero. Therefore, a lot of times, we just initialize them to zero since it is one of the easiest things to do.

For more complicated models with more neurons, we usually use random initialization to break symmetry (otherwise, the gradients for the neurons could be the same for back prop). Note that this symmetry does not exist for basic linear regression, and so the random initialization is not necessary.

Some things to clarify:

You can technically initialize the weights to small random numbers too, and the results should more or less be similar to initializing them to 0.

Even though the first matrix multiplication between weights and X will be zero (if the weights are 0), the gradients from your loss will likely not be 0, and so gradient descent would still work.

The input vectors (X) are usually not all zeros (although it may contain some zeros). These are the input features, and we would generally expect them to be something meaningful.

Note that if you already have good weights (for example, from a pretrained model), then it could very well be better to keep training (or fine-tuning) on existing weights rather than re-initializing the weights to 0 and starting from scratch. We just initialize to 0 if we don’t have better numbers to start with.

It looks like you already found this thread, but I’ll add the link for anyone else who finds the current thread. It gives the math that hackyon mentions to show that zero initialization does not prevent learning in the logistic regression case. Based on that example, you can do the analogous derivation for linear regression. The cost function is different, of course. But once we get to real Neural Networks, we’ll need symmetry breaking as hackyon mentioned and that is also shown.

The zero-weight initialization doesn’t hurt the learning process for linear or logistic regression. This is because the gradients are based on the product of ‘x’ and the error between f_wb and y.