I see that we should initially set some values for weights and they can’t be 0 - we use some random function to do it. The question is: is it possible that random gives us such weights that our algorithm will not reach convergence?
In a Dense network, each unit in a layer is connected to all of the units in the adjacent layers.
I see, but formula confuses me, because our L2 has only 1 unit (j-index) and then we should have only 1 weight. Am I understand correctly, that amount of connections (and weights) differs from one layer type to other and I will learn it later?
This is a better representation (omitting the bias units for clarity):
There are two input features.
There are three hidden layer units
There is one output layer unit.
Each unit in A1 and A2 also has a bias value (not shown here).
j is just a variable addressing a particular unit in a layer. If you look at the enlarge portion of the diagram you posted, you can see for layer3, there are 3 units. Each circle within a layer is a unit.
So for layer 2, there are 5 units. If we were to take layer 2 unit 4, j would be 4 and l would be 2.