Vectorization implementation question

On uploaded slide (it’s from vectorization lectures) matrix A has to be transposed to calculate Z. While i understand math behind it i wonder if A shouldn’t be in transposed shape from beginning?

Based on previous lectures each unit in layer has two weights W.shape[1] , so training set should have two features (columns). If im right then matrix A should have 3 rows and 2 columns from beginning?

A = np.array([
[1,2],
[-1,-2],
[0.1,0.2]
]) ?

What am i missing here?

Hi @Michal_Kurkowski ,

I’d like to attempt an explanation to your question with some hypothetical exercise to try to help build intuition on this matter.

Disclaimer: I will skip and oversimplify many steps here.

You already learned about Forward Propagation in week 1 of this course. This is one of the 2 fundamental components of the learning process in Neural Networks. You saw that, for each layer you have to follow 2 steps:

  1. Calculate a linear function Z = w x A.T + b
  2. Calculate a non-linear function A’ = f(Z)

So the linear function w x A.T + b will happen over and over again, from layer 1 to layer L (last layer).

Another concept that is important to keep in mind: In the forward prop, in the very first layer when we feed the data to layer 1, our goal is to feed all features of the data sample to all units of this first layer, so that each unit learns and decides what to represent of all the features. And this goal of “feeding all features to all units” is repeated layer after layer.

Ok, lets get started with this hypothetical exercise:

In layer 1 you will do, as described above, Z = w x A.T + b. And in this case of layer 1, A is actually the training dataset X, where we have all the samples, each with its features. We will say that X = A0.

Ok? so, let move on:

Ignoring the convention and definitions on how X should be formatted, lets say we do as you say: Start X “already transposed”… that is, with the features as rows and the samples as columns.

In this hypothetical case, we could go straight to Z1 = w1 x A0 + b1 avoiding the transpose operation of A0, right? and now we have Z1 and we apply the non-linear function to get A1. By the way, in this first step we reached our goal of hitting all units with all features of X.

Next we move to layer 2. The parameter w2 of layer 2 is of shape (layer2.number_of_units, layer1.number_of_units).

Now we feed A1, which we calculated in the previous step, to layer 2, which happens to have a different number of units than layer 1 (although it could have the same number of units too).

On layer 2 we have to do again a linear operation Z2 = w2 x A1 + b (note that I omitted the transposition) using the parameter w2 of layer 2. But wait… now A1 is not in a shape that can be multiplied with W2 because layer 2 had more units so w2 has a different shape! what can we do? we’ll need to do a transpose to do Z2 = w2 x A1 + b2, so we go back to Z2 = w2 x A1.T + b2

If layer 2 had the same number of units of layer 1, then the MatMul may have worked BUT we would have failed on our goal of hitting all layer 2 units with all layer 1 “features”, and instead we would be feeding all layer 2 units with all layer 1 “samples” and the result would be caotic.

Conclusion:

You can see how starting with an X “already transposed” will only avoid the very first transposition, as when you advance through the neural network, and actually in the very next layer, you’ll need to use the transposition anyways, not only to make possible the MatMul but also to achieve the goal of bringing “all features to all units”.

Going back to “Ignoring the convention and definitions on how X should be formatted”. It is established (and in my head, logically established), that X, which is the data, should have each sample as a row, and each row should contain the features as columns. I guess everything could have been defined differently from the very beginning of AI times, but it seems logical to have samples as rows and features as columns.

What do you think of all this?

Juan

Hi @Juan_Olano, thanks!

I still need some time to figure out some things. I’ve recalculated everything on paper. I agree, that “X, which is the data, should have each sample as a row, and each row should contain the features as columns”. However I still have some doubts.

Lets say that:

  • X (data) has 100 samples with 3 features. It should be written as 100x3 matrix. (100 rows, 3 columns)
  • if we want to do vectorized calculations for this dataset for layer with 4 units then W should be 3x4 matrix
  • intuition tells me correct way to calculate Z is : Z=X @ W + B (z is 100x4 matrix)

I don’t understand why we should transpose product of above equation for next layer. If we redo the same computations in next layer with G(Z) everything looks ok for me.

My confusion comes also from 66th slide from week 1 (uploaded slide below), where we do calculations as described above.

The difference comes from the way A ( or X) is defined between those slides.

Thanks for the reply. Actually you have some things to be reviewed:

If the input is 100x3 and the layer has 4 units, then the W is (4x3) where the 4 rows are one per unit, and the 3 columns are 1 per incoming feature, and not (3x4) as you indicate.

Another item to review:

Here the order of the operands is wrong. The formula is Z = WX.T + b, where the operation between W and X.T is a dot product. And dot product is not commutative, so in your case you have it X @ W, while it should be W @ X.T.

Finally, it is important to reiterate that the operation between W and X is a dot product and not an element-wise multiplication.

Please check these 3 concepts and let me know what you think.

Also, I’d like to suggest to revisit this video:

Course 2 - Week 1 - Forward prop video

Thanks,

Juan

Hi,

Thanks for response :wink:

I’m confused here. It’s how it was presented in lectures - weight for each unit were written in columns.

It’s also how in lab “C2_W1_Lab02_CoffeeRoasting_TF” layer with 3 units shows weights

(...)
Dense(3, activation='sigmoid', name = 'layer1'),
W1, b1 = model.get_layer("layer1").get_weights()
W1(2, 3):
 [[ 0.08 -0.3   0.18]
 [-0.56 -0.15  0.89]] 

Equation: Z=X@W + B also comes from lecture “How neural networks are implemented efficiently”.

def dense(A_in, W,B): 
   # A_in = X (slide uploaded in previous slide)
   Z = np.matmul(A_in, W) + B
   A_out = g(Z)
   return A_out

I’m going to revisit all the things you suggested! Thanks

@Michal_Kurkowski ,

I am reviewing the videos of the week you mention. I also find it confusing and I am asking someone else to look into this. The shape of W seems wrong in the slides and, as per your comment, possibly too in the lab.

Regarding this, this seems OK. The 3 indicates that the later has 3 units.

This would also be wrong. The equation is Z = W@X.T + b. And the operation between W and X is a dot product, which cannot be commutated.

As shared, I am reviewing this with a Super Mentor and I’ll be back with you shortly. In the mean time, if you review the link I sent you, in that video you’ll be able to see reflected what I am telling you.

Thanks,

Juan

Michal, I have asked @rmwkwok to help us here. As soon as he can, he will surely enlighten us. Thank you for your patience.

Juan

Hello @Michal_Kurkowski!

I watched the optional video on vectorization, and Andrew didn’t say that A comes from a previous lecture, so your assumption that A is representing a dataset is not quite supported. This slide is only about matrix multiplication, and the A is just any A. Therefore, the slide has no problem.

Now we look back, if we consider W be the weights of the first layer of a NN, then that layer has 4 neurons and is expecting for an input of 2 features. That makes the shape of W (2, 4). So, if I were to create a input dataset of 3 samples, it will have a shape of (3, 2).

@Michal_Kurkowski, I think if we do not regard the A on the slide as a dataset, then all doubts would be cleared. Agree? I want to make sure this point is clear first.

I have also read @Juan_Olano’s reply. I think the key here is that, for the W matrix of the l-th layer,

  • it has the shape of (# neurons in layer l-1, #neurons in layer l) in the Machine Learning Specialization, however,
  • it has the shape of (# neurons in layer l, #neurons in layer l-1) in the Deep Learning Specialization.

So I think Juan’s reply is based on the DLS notation, however, @Michal_Kurkowski, please stick with the MLS notation for the time you are here, and if you will take the DLS after finishing the MLS, then you may want to be aware of such difference.

Cheers,
Raymond

2 Likes

Hi,

thanks for explanation. Everything is clear now :wink: