Please correct me if I’m wrong. From what I understand in the lecture, each hidden layer would separately go through gradient descent process to reach the local minimum (or absolute one if possible), and in each layer, it could have multiple units which will give different variables in an activation for the next layer.

My question: how can a hidden layer with same identical equations and inputs give you multiple different variables as result? Does it mean that initial sets of w,b mainly decide what type of variables you will get from the hidden layer? If so, how can we pick initial values of w,b? If not, what actually make multiple identical equations giving out different results with the same input?

There is one gradient descent process that updates all of the weights in the NN.

If the initial weight values are set randomly, then each unit will begin adjusting its weights from different initial conditions.

This allows each weight to learn a unique value.

The concept is called “symmetry breaking”.

The NN cost function is not convex, so tghere is more than one solution that will give a local minimum cost.

So we aren’t guaranteed to get the global minimum - but we should find a minimum that is “good enough”.

It may take several tries with different initial conditions.

Thank you for the response. This makes a lot of sense now. Is there any method relating to manually picking initial weight? And how do you know the amounts of units and layers you need to get “good enough” result?

Hi @Thong_Nguyen ,

To the great answer given by @TMosh I would like just to add a note:

As you know, each layer goes first through a linear transformation (W*a+b) and then a non-linear transformation.

You said “same identical equations” and this is true: the equation per-se is identical: w*a+b,

And then you said [same identical] inputs, and this is not true. In the linear equation, the W matrix contains values that are mainly different for each neuron of the layer.

It all starts with the forward propagation: in the very first propagation, the W of each layer are initialized randomly (there are several algorithms to initialize W).

So we actually start already with different values for each unit in each layer.

Once the forward prop ends, we calculate the cost function, and start the backprop.

The backprop goes layer by layer and, for each layer, updates the W matrix and the b bias, among other things. In this process the ‘new’ W parameters continue being different for each unit.

Another way to gain intuition on this:

What would happen if we started with all W parameters set to, say, 0? in that case the formula W*a+b in all steps of the forward prop would always give the same result:

**W a+b = 0a+b = b**

And the backprop would not have gradients to apply, so there would be no learning at all. In that case, your premise would be true. And that’s why we always initialize W with random values.

I hope this enhances your intuition on the subject.

Juan

Experimentation is the usual method, especially for the number of units and number of layers.

You want a system that’s just complicated enough to work well, but not take too long for training or prediction.

Good enough means that you get sufficiently good results when you check the performance using a test set.