Always confusion with the transpose


in some layer with m units and n input values, we have a matrix w with the dimensions (m, n). For example, let us have 3 input values x and 4 units, then our matrix w for this layer has 4 rows and 3 columns.

Now when we multiply this matrix with the vector of the x values, we need to transpose the matrix w, so that the dimensions match with the x vector. (Actually, we are transposing the single elements w_i (for each unit i), which are vectors. But I think you know what I mean.)

My first problem is that it seems to be inconsistent how this is organised. IIRC, in the MLS it was just the other way round. Instead of transposing w, we transposed the x vector.

However, it would be more readable and understandable if we could avoid using the transpose operation at all. So why don’t we organise our w matrix right from the start the other way round? Do I oversee something?

Best regards

Hello @Matthias_Kleine,

DLS Notation check first. If you check out the Standard notations for Deep Learning.pdf downloadable in this post, you will find the definition for X and W are:

X \in \mathbb{R}^{n_x \times m} where n_x is the input size.
W^{[l]} \in \mathbb{R}^{n^{[l]} \times n^{[l-1]}} where n^{[l]} is the number of units in layer l

So when we multiply them together, we only need W^{[1]}X without any transpose.

You can also see this in the video below



what probably confuses me is that usually when I use Pandas, the single “data points” are organized as rows, and the single features are the columns.

For example in the Titanic data set, each single person is a row, and the attributes like “sex”, “age” asf. are the columns.

But Andrew organises this just the other way round for the training samples:


Is there any special reason for this?

(The screenshot that you give above must be out of one of the future videos, which I didn’t view yet … could you add the video link?)

Best regards

Hello @Matthias_Kleine,

Here is the link. It’s in Course 1 Week 4.

Yes, it’s very common to have rows for samples and columns for features, so to adapt that kind of data to the DLS, my suggestion is to transpose your X once right before DLS-related code starts. Since I wasn’t in the discussion of deciding the notation, I can’t explain it. However, it is a valid notation and more importantly, it is the same and default notation in the DLS.


PS: Sounds like you are already playing with some data, and have fun with that :wink:

As Raymond says, all these decisions are just that: decisions. You can make them in different ways and, of course, lots of consequences flow from those decisions. Prof Ng is the boss here, so he gets to make the decisions and we just have to pay attention and understand the way he is defining everything.

When you finish the classes and start to do things on your own, then you can make your own decisions. But note that you probably will also be using packages and frameworks like Pandas, TensorFlow, PyTorch etc, so you have to understand the definitons of their various APIs. When you get to dealing with images, then in addition to the “samples” dimension, you also have to deal the position of the “channels” dimension. TF e.g. supports either “channels first” or “channels last” mode. Not sure whether PyTorch also allows the flexibility. If you stick around through DLS C4 to learn about ConvNets (highly recommended!), you’ll see that there Prof Ng switches to using “samples first” and “channels last” orientation, whereas he chooses “samples last” here in C1 and C2 when we are dealing with Feed Forward Networks.