After going through the first week of MLS Course 2, I now have a good idea of how to use a basic neural network, but I don’t understand how they are trained or really why we need multiple units in a layer.
In the coffee roasting example, we used 2 layers. The first layer had 3 units and the second layer had 1 unit, both using sigmoid for the activation function.
I believe that before the model is trained, each unit’s weights and bias (W, b) are initialized randomly. If each unit in the first layer gets trained on the same data, wouldn’t they all end up with the same weights at the end? Or am I missing something? Is each unit in a layer being trained on different data?
The hidden layers use the units in the previous layer (the input layer, for example), and distill the features into a broader set of more abstract characteristics.
For example, in a NN that processes images, the input may be all of the pixels of an image, and the hidden layer may represent just particular shapes that have been learned from the raw pixels.
By setting the initial hidden unit weights to small random values, this helps assure that each hidden layer unit will have a different initial position in the solution, and will learn along a different curve of gradients (because the NN cost function is not linear, it has many local gradients).
Mathematically it is difficult to prove, and is beyond the scope of this course.
Any following layers (an output layer) for example would combine the shapes in the hidden layer into a representation of some object (like a cat, or dog, or a number or letter).
To be clear I want to check my conclusions. Please let me know if either of these conclusions are wrong or need clarification:
Each unit in a layer is trained with the same data.
The reason each unit in a layer does not end up with the same weights after training is because each settles on a different local minima due to initial weights being set randomly.
This leads me to another question:
3. In some situations, might a layer with 3 units all end up with the same weights after training even though their weights have been set randomly to begin? I would imagine this could happen if there was only 1 minima.
I don’t understand why it would be very unlikely. In a very simple situation, using a layer with multiple units, wouldn’t it make sense that each unit would end up with the same weights after training? If there is only 1 local minima then they all would end up reaching the same weights. No?
Having one local minimum only means there is only one set of weight values that corresponds to a minimum cost. However, among the weights values they can be different.
The reason why we will end up with different weights is exactly because we initialize them differently at the beginning, please refer to this post for a simple explanation and within which there is another link to some maths and an example for explaining why weights get diversified.