Why did the assignment ask the Dense Layer weight matrix to be normally distributed with stdev of 0.1? Shouldn’t the matrix be initialized as random and later adapted to whatever the incoming data is. Why force fit normal distribution?
Hello @Mayank11 ,
Thanks a lot for asking. I will do my best on this reply to give an answer to why we initialize with stdev of 0.1 instead of random initialization. I will also provide some reasons to why initializing with normal distribution makes statistical sense.
The standard deviation determines the spread of the values in the normal distribution. A smaller standard deviation, such as 0.1, results in weights that are closer to zero and less likely to be too large or too small. This can help stabilize the learning process and prevent the weights from growing too large, which may lead to numerical instability or slow convergence
While it is true that the weights can be adapted to the incoming data during the training process, the initial random initialization with a normal distribution provides a good starting point for the learning algorithm. The network can then adjust the weights based on the training data to improve its performance.
The main takeaway for initializing with stdev of 0.1 is that it stabilizes the learning process.
Fitting into normal distribution is a choice and it makes statistical sense and here is the reasoning behind that.
The goal of weight initialization is to provide enough variance in the weights to allow for learning to take place. The normal distribution provides a good balance between having a spread of values and preventing the weights from being too large or too small, which can lead to numerical instability or slow convergence.
The normal distribution has the maximum entropy among all distributions with the same mean and variance. This means that it makes the fewest assumptions about the weights, allowing for more flexibility and adaptability to different data distributions.
So, we initialize weights with a normal distribution and stdev of 0.1 because it makes statistical sense. It is important to converge with as few epochs of training as we can and this is the main reason why we initialize weigths with a normal distribution with stdev of 0.1.
Please let me know if you have further questions as a reply.
Regards,
Can
Thanks for this answer. What I am getting is that it is statistically more probable to converge faster if weight matrix is initialized using normal distribution with small stdev of 0.1.
A couple of followup questions 

Weight matrix for Dense Layer is weights for multiple units stacked on top of one another. Shouldn’t these individual units’ weight be normally distributed rather than the entire matrix ( what i am asking is that should we not be thinking about normally distributing along an axis rather than on full matrix)?

Should we always initialize all other layers’ weights such as embedding, dropout, etc with normal distribution of stdev = 0.1?
Hello @Mayank11 ,
Thanks a lot for the followup questions. I will do my best to reply to your questions.

Each individual unit’s weight in the Dense Layer should indeed be normally distributed along an axis, rather than the entire matrix. In a Dense Layer, the weight matrix represents the weights connecting the input units to the output units. Each row of the weight matrix corresponds to the weights for a single output unit, and each column corresponds to the weights for a single input unit. Therefore, the weights for individual units within the matrix should follow a normal distribution along the appropriate axis.

It is not necessary to always initialize all other layers’ weights, such as embedding and dropout, with a normal distribution and a standard deviation of 0.1. The choice of weight initialization method can depend on the specific layer and the requirements of the neural network architecture. Different layers may benefit from different initialization strategies. It is important to consider the characteristics of each layer and experiment with different initialization methods to find the most suitable approach for your specific neural network.
I hope my replies answer your issues.
Regards,
Can