Matrix multiplication on coffee roast optional lab

namco12 · May 15, 2024, 5:56pm

When I was doing this lab I noticed that the way that tensorflow expresses the W (weights matrices) is as (number of features in input, number of units in the layer). So in this lab the W1 matrix is a (2,3) matrix.
The thing that confuses me is that in order to get a output vector from layer 1, you need to calculate a vector of 1 x 3 or 3 x 1. If we take into account that the formula for a single neuron in a layer is w . x + b. There is no posible value of x to get a 3 x 1 or 1 x 3 vector as result. Because an m x n matrix multiplied an n x p matrix has m x p result. So 2 would be m in this case.
I guess that it is a tensoflow notation and how the matrix multiplication is done, works in a different way. But it would be nice to have that one sorted out.

TMosh · May 15, 2024, 7:19pm

You may have to use a transposition.

This is not unusual in machine learning, because there is no universal standard for how the data is oriented in the X matrix.

malcolm.lett · May 17, 2024, 6:12am

I came here to ask the same question, so I’m glad it’s been asked already. In, fact I found that the same question was also asked in 2022: Weight matrix dimension in TensorFlow. That has part of the answer:

The key message is that there’s apparently a performance and computational accuracy benefit to storing W in transposed form and then doing the transposed matrix dot product.

I’m just as novice as anyone else here, but let me attempt to offer some thoughts about why that might be the case.

Firstly, from the point of view of the lectures, given a single input x with M features, and a single dense output layer with K outputs (ie: K neurons in this single layer), ignoring the bias and activation function, we have the weights matrix W and the dot product as follows:
C2_W1_Lab02_Weights_question-Wxm.drawio

I’ve highlighted one row in the W matrix, the single column in the vector, and the resultant output cell in the activation vector. One important feature as I see it is that the highlighted row and column represent the input features to this layer. They must have the same size as each other. Additionally, the orientation of that highlighted row and column are always this way - because that’s how matrix dot products work. For example, you can’t just put x before W and flip the orientation.

When we do a prediction, say, using only a single data sample, then we just have that list of features and nothing else. But, importantly, those features are all equally part of the one data sample.

Now consider a real data set. By convention of data science, rows are the samples and columns are the features for each sample:
C2_W1_Lab02_Weights_question-dataset.drawio

As mentioned in the lecture, TensorFlow was designed for large data sets. In particular, it is usually run in “batch mode”, calculating the results across multiple samples in parallel. If we take the same a = W \cdot x + b pattern and lay that out, we get a dot product between two matrices, and an activation matrix as the result: A = W \cdot X^T + b, as follows:

Notice that we’ve got very wide X and A matrices, for the N samples, which might be in the thousands or more.

There’s two problems that I see with that:

Problem #1: Have to transpose original dataset X before use:

Matrices are internally stored as a flattened 1D array, with each row concatenated.
To transpose a large matrix, you have to construct a brand new 1D array (think: memory consumption), and then run through the source array in a weird to-and-frow order, copying over values to form newly shaped rows.
For small matrices this is fine, but for large datasets it’s a problem.

Problem #2: Batching splits on the wrong axis

For large data sets, the input rows are typically split up into batches of, say, 32 rows at a time. That enables us to maximise the use of the GPU’s internal capacity, but also to handle data sets that are larger than would fit in the GPU memory.
That batching obviously works on the rows in the data representation.
If the X matrix has been transposed, then we’ll be batching across the rows in the transposed matrix, which means that we’ll be splitting it out at 32 features per batch, and still representing all N samples, eg: a (32x10000) matrix. Never mind the fact that this won’t fit in the GPU, because we’ve basically corrupted our data. The alternative would be to batch against the original X, and then copy each batch at at time, but it’s still just a mess of extra computations.

The solution:

Keep X as is, and transpose W instead.
If we always operate on W in the transposed form, then we never need to transpose it back and forward, so we just eliminate that problem altogether.

Now, we basically transpose the entire equation, resulting in A = X \cdot W^T, and we always operate on the weights in their transposed format. This gives the following:
C2_W1_Lab02_Weights_question-XxW.drawio

To highlight how much better this is, consider how we can split out the data when there’s too much to fit into the GPU in a single go. Firstly, we can easily batch the data set 32 rows at a time. Secondly, because each neuron in a layer is independent of the others, we can calculate them in parallel - or in batches. So if we’ve got too many neurons in the layer, we can split them out into batches too. Thus we can split this operation as follows, without causing problems:
C2_W1_Lab02_Weights_question-batched.drawio

(I’m not sure if splitting out the neurons in blocks like that is actually done though)

rmwkwok · May 19, 2024, 5:49am

Wonderful analysis, @malcolm.lett!

We can define any W or X as we like, as long as the maths can work out. Tensorflow defines it this way, so there will be no transpose at all:

where

m: number of samples
n, n’ number of input/output features

m the number of samples is always the leading dimension.

Cheers,
Raymond

namco12 · May 29, 2024, 4:56pm

Great analysis @malcolm.lett ! Thank you!

alocs · June 6, 2024, 3:19pm

Great analysis, I was wondering the same thing.

Here you mean A = X^T \cdot W^T right? Unless you have decided to call the transposed (computer science convention) one the new X.

TMosh · June 6, 2024, 10:57pm

(mentor edit: I added the LaTeX framing to @alocs message)

malcolm.lett · June 14, 2024, 5:38am

Hi @alocs, I don’t think so.
The whole point is that we don’t want to transpose X - that’d be computationally expensive.

What I was trying to say is that A = W \cdot X^T = X \cdot W^T. So we’re better off working with a transposed W than a transposed X.

Topic		Replies	Views
Matrix lay out in the tensorflow Advanced Learning Algorithms week-module-1	6	287	January 24, 2024
Weight matrix dimension in TensorFlow Convolutional Neural Networks coursera-platform	1	923	September 21, 2022
Data in Tensorflow Video Advanced Learning Algorithms week-module-1	1	354	August 26, 2023
Ambiguity regarding weight matrix in Graded Quiz - Week 3 Neural Networks and Deep Learning coursera-platform	4	544	November 9, 2023
Matrix multiplication lecture clarification - NN - Why do we transpose at all Advanced Learning Algorithms week-module-1	1	353	September 17, 2023

Matrix multiplication on coffee roast optional lab

Related topics