Week 3, lesson 5

Connor_Gray · June 27, 2022, 5:45pm

This explanation is throwing me off, and seems inconsistent with the material that has been presented up until this point. Andrew describes z[1],(1) as being calculated by w[1] * x(1) + b[1]. First of all, this seems incomplete. w is a matrix with one row for each node in a hidden layer, and one column for each weight. These weights match with each node from the previous layer. This means that z[1],(1) should instead be calculated as w[1],(1) * x(1) + w[1],(2) * x(2) + … + b[1]. Andrew does not use x(2) or x(3) or any other x values to calculate z[1],(1). However, he uses x(2) to calculate z[1],(2). This seems wrong. Any help understanding this would be greatly appreciated.

anon57530071 · June 28, 2022, 1:38am

Welcome to the community !

First of all, I’m afraid that you may mix up the layer number and the hidden unit number.
If we go back to the 2nd video of this week, it defines as follows.

So, in the 5th video, Andrew is talking about a set of hidden units, a^{[1]}, not individual hidden unit like a^{[1]}_1,a^{[1]}_2,a^{[1]}_3, and a^{[1]}_4.

Then, let’s revisit this equation, which is for the i th sample, x^{(i)}, in the first hidden layer.

z^{[1](i)} = W^{[1]}x^{(i)} + b^{[1]}

As you see, weight W^{[1]} is also stacking weights for each hidden layer like;

W^{[1]} = \begin{bmatrix} W_1^{[1]}\\ W_2^{[1]}\\ W_3^{[1]}\\ W_4^{[1]} \end{bmatrix}

Weights for each hidden layer is row vector in this case. (Sometimes, Andrew uses W^{T} and sometime uses W, but, to simplify, let’s assume a row vector.)

Then, Z for the first hidden layer with the first sample x^{(1)} can be written as;

z^{[1](1)} = \begin{bmatrix} z^{[1](1)}_1\\ z^{[1](1)}_2\\ z^{[1](1)}_3\\ z^{[1](1)}_4\\ \end{bmatrix} = \begin{bmatrix} W^{[1]}_1\cdot x^{(1)}+b^{[1]}_1\\ W^{[1]}_2\cdot x^{(1)}+b^{[1]}_2\\ W^{[1]}_3\cdot x^{(1)}+b^{[1]}_3\\ W^{[1]}_4\cdot x^{(1)}+b^{[1]}_4\\ \end{bmatrix}

Then, put all samples (m) in here…

z^{[1]} = \begin{bmatrix} z^{[1]}_1\\ z^{[1]}_2\\ z^{[1]}_3\\ z^{[1]}_4\\ \end{bmatrix} = \begin{bmatrix} W^{[1]}_1\cdot x^{(1)}+b^{[1]}_1 & W^{[1]}_1\cdot x^{(2)}+b^{[1]}_1 & W^{[1]}_1\cdot x^{(3)}+b^{[1]}_1 \\ W^{[1]}_2\cdot x^{(1)}+b^{[1]}_2 & W^{[1]}_2\cdot x^{(2)}+b^{[1]}_2 & W^{[1]}_2\cdot x^{(3)}+b^{[1]}_2 \\ W^{[1]}_3\cdot x^{(1)}+b^{[1]}_3 & W^{[1]}_3\cdot x^{(2)}+b^{[1]}_3 & W^{[1]}_3\cdot x^{(3)}+b^{[1]}_3 \\ W^{[1]}_4\cdot x^{(1)}+b^{[1]}_4 & W^{[1]}_4\cdot x^{(2)}+b^{[1]}_4 & W^{[1]}_4\cdot x^{(3)}+b^{[1]}_4 \\ \end{bmatrix}

So, if you focus on the output from the first unit in the first hidden layer, it is;

z^{[1]}_1 = \begin{bmatrix} W^{[1]}_1\cdot x^{(1)}+b^{[1]}_1 & W^{[1]}_1\cdot x^{(2)}+b^{[1]}_1 & W^{[1]}_1\cdot x^{(3)}+b^{[1]}_1 \end{bmatrix}

I suppose this is what you wanted to see.
So, I encourage you to revisit past videos. All are well-explained by Andrew.

Hope this helps.

Connor_Gray · June 28, 2022, 2:36pm

Thank you, this is very helpful. One more question, I noticed that there are bias terms coming from each X, all contributing to the final value Z. I thought that each node only has one bias term, as it has been presented as a single column assigned to a layer. Is this bias column for a layer contributing the the Z for that layer, or is it used in the calculation of the Z terms in the next layer, as you showed in your response?

Connor_Gray · June 28, 2022, 2:49pm

Okay, the bias term contributed to the layer it is assigned to, I was just confused because you showed the bias term with each X. Does the bias term contribute to a Z value only once or does it contribute once for every X?

anon57530071 · June 28, 2022, 3:48pm

Sorry for making you confused. I should continue to write x^{()} till x^{(m)} not stop at x^{(3)} which is slightly confusing with x_3. I will update tomorrow, and add some description about X which is stacking x^{(1)},x^{(2)}, ..,x^{(m)} for more clarification to other learners who may visit here.

To answer your question, bias term is added to every x^{(i)} after “dot product” with W. In here,

x^{(i)} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ x_3^{(i)} \end{bmatrix}

You also vectorized equation in Andrew’s chart.

Z^{[1]} = W^{[1]}X + b^{[1]}

b^{[1]} is a single column vector and “broadcasted” to W^{[1]}X which has m columns.

Here is Andrew’s sketch.

anon57530071 · June 29, 2022, 6:44am

< Here is an updated version. A few additions are; 1) clarify structures of x^{(i)}, b^{[l]}, and X, and 2) extend coverage to x^{(m)} not just 3 samples which may be confusing with the number of features for x_3.

First of all, the overview of the network and notations are as follows.

In the 5th video, Andrew is talking about a set of hidden units, a^{[1]}, not individual hidden unit like a^{[1]}_1,a^{[1]}_2,a^{[1]}_3, and a^{[1]}_4.

Then, let’s revisit this equation, which is for the i th sample, x^{(i)}, in the first hidden layer.

Z^{[1](i)} = W^{[1]}x^{(i)} + b^{[1]}

As you see, weight W^{[1]} is also stacking weights for each hidden layer like;

W^{[1]} = \begin{bmatrix} W_1^{[1]}\\ W_2^{[1]}\\ W_3^{[1]}\\ W_4^{[1]} \end{bmatrix}

Weights for each hidden layer is row vector in this case. (Sometimes, Andrew uses W^{T} and sometime uses W, but, to simplify, let’s assume a row vector.)
Bias for the first hidden layer, and the first sample are described as follows.

b^{[1]} = \begin{bmatrix} b_1^{[1]}\\ b_2^{[1]}\\ b_3^{[1]}\\ b_4^{[1]} \end{bmatrix} , \ \ \ \ x^{(1)} = \begin{bmatrix} x_1^{(1)}\\ x_2^{(1)}\\ x_3^{(1)}\\ \end{bmatrix}

Then, Z for the first hidden layer with the first sample x^{(1)} can be written as;

Z^{[1](1)} = \begin{bmatrix} z^{[1](1)}_1\\ z^{[1](1)}_2\\ z^{[1](1)}_3\\ z^{[1](1)}_4\\ \end{bmatrix} = \begin{bmatrix} W^{[1]}_1 x^{(1)}+b^{[1]}_1\\ W^{[1]}_2 x^{(1)}+b^{[1]}_2\\ W^{[1]}_3 x^{(1)}+b^{[1]}_3\\ W^{[1]}_4 x^{(1)}+b^{[1]}_4\\ \end{bmatrix}

Then, put all samples (m) in here…

So, if you focus on the output from the first unit in the first hidden layer, it is;

Z^{[1]}_1 = \begin{bmatrix} W^{[1]}_1x^{(1)}+b^{[1]} & W^{[1]}_1x^{(2)}+b^{[1]} & W^{[1]}_1x^{(3)}+b^{[1]} & .... & W^{[1]}_1x^{(m)}+b^{[1]} & \end{bmatrix}

If we write X in Andrew’s format, it is;

X = \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & x_1^{(3)} & .... & x_1^{(m)} \\ x_2^{(1)} & x_2^{(2)} & x_2^{(3)} & .... & x_2^{(m)} \\ x_3^{(1)} & x_3^{(2)} & x_3^{(3)} & .... & x_3^{(m)} \end{bmatrix} = \begin{bmatrix} | & | & | & .... & | \\ x^{(1)} & x^{(2)} & x^{(3)} & .... & x^{(m)} \\ | & | & | & .... & | \end{bmatrix}

The vectorized equation in Andrew’s chart, Z^{[1]} = W^{[1]}X + b^{[1]}, is exactly same as the above equation (1).

Topic		Replies	Views
The output of a neural network layer Advanced Learning Algorithms week-1	8	501	September 28, 2023
Sigmoid function within each layer Advanced Learning Algorithms week-1	6	571	November 27, 2022
Week 3: Vectorizing Multiple Examples' video Neural Networks and Deep Learning	4	543	June 23, 2021
[Course 1 Week 3] Quiz question Neural Networks and Deep Learning	3	642	June 14, 2022
Week 3, Video 3: Understanding matrix size Neural Networks and Deep Learning	7	657	January 29, 2025

Week 3, lesson 5

Related topics