Why does Andrew always transpose before matrix multiplication?

Hello everyone, did I misunderstand something, or does Andrew always transpose before carrying out the matrix multiplication? In the video “matrix multiplication” he does it also for the multiplication of two 2x2 matrices.

I find it very confusing, since you don’t get the same result then, as if you hadn’t transposed the two 2x2 matrices. I get the same result as a matrix calculator online so I’m quite confused now.

Hello @Rickard, the rule of thumb to remember about matrix multiplication is, we can only multiply a matrix A with size (a, b) with another matrix with the size (b, c). Here, a, b, and c can be any number but the constraint is having the same b in both sizes as shown. When you are not sure whether you need to transpose, write down the sizes of both matrix, see if they have a matching b at the right places, if not, you will need to transpose one or both matrices.

For example, if you feature matrix a for a sample is a column matrix of size (n, 1), and your neural network layer’s weight matrix W is (n, k), representing k neurons. To multiply a with M, you need to transpose a to get a^T which has the size (1, n), then you can multiply it with W, which is a^TW. However, if you want to switch their positions, you will need to tranpose W instead such that you will be doing W^Ta, because in this way, their sizes are (k, n) and (n, 1) respectively, which, again, has that matching n in the middle.

The rule of thumb should sweep away all uncertainty about whether 2 matrices can be multiplied together.

If you have a counter example which confuses you, please feel free to share the counter example here.



Hey, thank you for the answer! So the first matrix should have the same amount of columns as the other should have rows. It’s very logical when I sketch it with pen and paper. :slight_smile:

However, it’s this example that’s confusing me:

It’s from the video called matrix multiplication. Both matrices are 2x2 matrices, so why does Andrew transpose the first matrix?

I see.

The rule that a M_{(a,b)}N_{(b,c)} works is purely mathematical. (Note I have put the size as subscripts of the matrices). Indeed the rule becomes not sufficient when it comes to a square matrix.

In ML, when we multiply the sample matrix and the weight matrix, we need to figure out in the sizes, which of the numbers represent number of samples m, number of features n and number of neurons k.

In the sample matrix A, it is a matrix of m samples and n features. In the slide, inferred on that one rectangle is one sample, it should be like this A_{(n, m)}, which is m columns of samples and n rows of features. In the weight matrix, it is a matrix of k neurons and n features, and again, based on that one retangle is one neuron, that should be W_{(n, k)}. Now, we can go to the rule again, if we do A_{(n,m)}W_{(n,k)}, this won’t work because the two middle sizes do not match (in meaning), however, (A^T)_{(m,n)}W_{(n,k)} can work because the two middle sizes match perfectly both in meaning and in values.