Long time ago I read this somewhere that it is preferred to use random initialisers for the weights as it help in faster convergence than zero or one initializers.
Why do you think this is the case? I couldnt recall the link, maybe it was machineleanringmastery or someother blog.
Initializing weights with zero or ones mean that you are starting with small weight numbers. This will take the algorithm more time to converge and with even a possibility of not converging or reaching the global minima (vs local minima). Random initializers, will oscillate between low and high numbers, giving the algorithm better chances of converging and reaching the global minima.
Can you show this mathematically?
Nice question @tbhaxor
Here is my point of view & I STAND TO BE CORRECTED IF WRONG
I do think if we start with zeros and ones we will affect our ERROR FUNCTION which is very important in neural networks learning
Here is a high level summary of training a neural network
Doing a feedforward(a process that neural networks use to turn the input into an output) operation.
Comparing the output of the model with the desired output.
Calculating the error.
Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
Use this to update the weights, and get a better model.
Continue this until we have a model that is good.
As you can see in step 3, Training a neural network is mostly about trying to minimize error fuctions (errors function is inversely proportion to probability) so we ‘throw in’ some number to the model see how it performs then compare it’s output with the desired output ( desired output - model output = error function) so i think if we start with zeros and ones we will hurt the training process.
There’s some nice mathematics behind it but it will take hours if I decide to do them right here try to research some books with Calculus for Machine Learning
Anyone with a different idea is so much welcomed!
Backward because of chain rule of derivation, right?
Also I have seen that in real world data, I have seen the using random weights it is easy to minimize loss in first 5 7 of epochs, but with constant like 0 or 1 it takes more than 5 7 epochs to what i say “actually start converging”. havent gone into the maths right now but this is my experience.
From real world data, I mean that any data which contain some kind of error (noise)
I’d like to add:
If you initialize the W weights in zero, what you’ll find is that all the neurons will produce the same outputs and the NN will not learn.
This is called the Symmetry Breaking. To break this symmetry you would initialize the weights with random numbers.
The mathematical proof becomes very simple: Each neuron receives all previous inputs in a fully connected layer, and you apply the linear function with the same value to all neurons, the result will be exactly the same.
This is the answer I was actually looking. I know I had heard this term before but couldnt recall it. Thank you @Juan_Olano