Hi everyone,
So I am a little bit confused on how the calculations within the layers work. First of all, when we pass in out matrix X into the first layer and it gets distributed to each unit, how come we are getting different parameters (.e w, and b’s) if it’s the same dataset that all three units share. Secondly, how is the parameter w and b being calculated if we just have X’s and the targets are not present? I thought the model needed the Y’s to do it’s computation. Would really appreciate some help with understanding this. Thanks so much.
The trick is in how the gradients of the w matrices are learned. The method is “backpropagation”, and it works from the output (where we have the ‘y’ labels), backward through the hidden layer (where we don’t have labels).
The process is complicated and isn’t covered in this course, but you can find explanations online quite easily. The video series from “3Blue1Brown” on YouTube is quite good.
Also if your drawing is of a neural network, you’re missing the depiction of the weight matrices, and of the hidden layer.
There is a weight matrix that connects each pair of adjacent layers.
Okay. I’ll check out the YouTube series. Thanks so much :). For the drawings, isn’t the weight matrices the “a[1]” in the screenshot, that shows it’s the output from layer one?
This drawing is backwards from the normal presentation. Here the weights are the rectangles, and the layers are the arrows. Usually it’s shown the other way around (the units are the boxes and the weights are the arrows).
In addition to what @TMosh clearly explained, I’d like to add a couple of thoughts regarding your questions:
Regarding your first: “how come we are getting different parameters (.e w, and b’s) if it’s the same dataset that all three units share.”
You’ll learn that, at each node, you’ll have the following operation: W.T*X + b, which is a linear function, and then you’ll apply an activation function, like ‘Sigmoid’.
From this linear equation we have that, as you very well said, X goes to all units of the layer, so all are getting the same values of X, but the difference is in W and b. Both W and b are initialized with random values, and then, as the NN is trained, they will be affected by the ‘backward propagation’ or ‘backprop’ process, which you will soon learn about. This backprop process is a series of calculations that happen from the end of the NN down to the beginning of the NN, and at each step, the backprop updates the Ws and the b’s of each layer. And this is what causes that we get different Ws and b’s on each layer.
Regarding your second question: " how is the parameter w and b being calculated if we just have X’s and the targets are not present? I thought the model needed the Y’s to do it’s computation."
The W and b parameters are initialized with random values. Then, once the NN is being trained, there will be many iterations of forward and backward ‘propagations’. It is in the ‘backward’ propagations where the W and b are updated, and this is actually the magic on the NN. It is in this process where the NN learns.
So my big hint on this note is: When you get to forward and backward propagation, make sure you understand those two process perfectly, because that is where most of the magic happens.
Hope this sheds some more light to your questions!
Juan
This makes a lot of sense now. Thanks so much for explaining