I was wondering if anyone could explain to me why the the vector sizes are what they are on this slide – I think there is a simple thing I am missing

For example why is our W vector 4x3 ? Is it because there is 3 features with 4 hidden layers? But then why is X simply 3x1 ? In my mind wouldnt you want them both to be 4x3?

And then b is 4x1… Why isn’t it 4x3? I know this is probably something obvious I should understand by now but its just a little unclear to me. Thank you!

For starters, w was a vector for logistic regression, but with a neural network, the weights are a matrix. The particular network layer that Prof Ng is showing us there is the first “hidden” layer, which has inputs that are 3 x 1 vectors and it outputs 4 neurons. That is why the shape of W needs to be 4 x 3, because the mathematical formula that we are implementing for the first step of the layer is this linear expression:

z = W \cdot x + b

If you work out the dimensions on that dot product, we have W is 4 x 3 and x is 3 x 1, so the result will be 4 x 1. Of course note that what we are showing there is the processing for a single input vector x. For efficiency, we will soon see that we can “batch” multiple x values together as the columns of a matrix X which will be 3 x m, where m is the number of input samples we have. Then we can do this operation and the matrix multiply takes care of all the x values in one operation:

Z = W \cdot X + b

Note the convention that Prof Ng uses is that if the object is a vector, the variable name is usually lower case. If it is capitalized, as in X, then it’s a matrix with multiple samples.

b is the “bias term”. The output z is 4 x 1 and b is added to that with a different value for each output neuron, which is why b is 4 x 1.

Note that in the case where we handle multiple samples at once, we have:

Z = W \cdot X + b

Now the dimensions on the dot product are 4 x 3 dot 3 x m, so the result will be 4 x m. Now we add b, which is 4 x 1, to that and it adds the same value across each row to get the final Z. So you could view this from a pure math perspective as duplicating b (a column vector) to form a 4 x m matrix before doing the addition. Since this is a very common operation, numpy has given us the concept of “broadcasting”, which means expanding one operand to match the required shape for an “elementwise” operation. Here’s a thread which talks about broadcasting and shows some examples.

1 Like

Here’s another thread from a while ago that goes through how we get from w being n_x x 1 in the Logistic Regression case to the way the W weight matrices work here and why we no longer need the transpose for real Neural Networks.

1 Like