I see that we should initially set some values for weights and they can’t be 0 - we use some random function to do it. The question is: is it possible that random gives us such weights that our algorithm will not reach convergence?

In a Dense network, each unit in a layer is connected to all of the units in the adjacent layers.

I see, but formula confuses me, because our L2 has only 1 unit (j-index) and then we should have only 1 weight. Am I understand correctly, that amount of connections (and weights) differs from one layer type to other and I will learn it later?

j is just a variable addressing a particular unit in a layer. If you look at the enlarge portion of the diagram you posted, you can see for layer3, there are 3 units. Each circle within a layer is a unit.

So for layer 2, there are 5 units. If we were to take layer 2 unit 4, j would be 4 and l would be 2.