I came here to ask the same question, so I’m glad it’s been asked already. In, fact I found that the same question was also asked in 2022: Weight matrix dimension in TensorFlow. That has part of the answer:
The key message is that there’s apparently a performance and computational accuracy benefit to storing W in transposed form and then doing the transposed matrix dot product.
I’m just as novice as anyone else here, but let me attempt to offer some thoughts about why that might be the case.
Firstly, from the point of view of the lectures, given a single input x with M features, and a single dense output layer with K outputs (ie: K neurons in this single layer), ignoring the bias and activation function, we have the weights matrix W and the dot product as follows:
I’ve highlighted one row in the W matrix, the single column in the vector, and the resultant output cell in the activation vector. One important feature as I see it is that the highlighted row and column represent the input features to this layer. They must have the same size as each other. Additionally, the orientation of that highlighted row and column are always this way - because that’s how matrix dot products work. For example, you can’t just put x before W and flip the orientation.
When we do a prediction, say, using only a single data sample, then we just have that list of features and nothing else. But, importantly, those features are all equally part of the one data sample.
Now consider a real data set. By convention of data science, rows are the samples and columns are the features for each sample:
As mentioned in the lecture, TensorFlow was designed for large data sets. In particular, it is usually run in “batch mode”, calculating the results across multiple samples in parallel. If we take the same a = W \cdot x + b pattern and lay that out, we get a dot product between two matrices, and an activation matrix as the result: A = W \cdot X^T + b, as follows:
Notice that we’ve got very wide X and A matrices, for the N samples, which might be in the thousands or more.
There’s two problems that I see with that:
Problem #1: Have to transpose original dataset X before use:
- Matrices are internally stored as a flattened 1D array, with each row concatenated.
- To transpose a large matrix, you have to construct a brand new 1D array (think: memory consumption), and then run through the source array in a weird to-and-frow order, copying over values to form newly shaped rows.
- For small matrices this is fine, but for large datasets it’s a problem.
Problem #2: Batching splits on the wrong axis
- For large data sets, the input rows are typically split up into batches of, say, 32 rows at a time. That enables us to maximise the use of the GPU’s internal capacity, but also to handle data sets that are larger than would fit in the GPU memory.
- That batching obviously works on the rows in the data representation.
- If the X matrix has been transposed, then we’ll be batching across the rows in the transposed matrix, which means that we’ll be splitting it out at 32 features per batch, and still representing all N samples, eg: a (32x10000) matrix. Never mind the fact that this won’t fit in the GPU, because we’ve basically corrupted our data. The alternative would be to batch against the original X, and then copy each batch at at time, but it’s still just a mess of extra computations.
The solution:
- Keep X as is, and transpose W instead.
- If we always operate on W in the transposed form, then we never need to transpose it back and forward, so we just eliminate that problem altogether.
Now, we basically transpose the entire equation, resulting in A = X \cdot W^T, and we always operate on the weights in their transposed format. This gives the following:
To highlight how much better this is, consider how we can split out the data when there’s too much to fit into the GPU in a single go. Firstly, we can easily batch the data set 32 rows at a time. Secondly, because each neuron in a layer is independent of the others, we can calculate them in parallel - or in batches. So if we’ve got too many neurons in the layer, we can split them out into batches too. Thus we can split this operation as follows, without causing problems:
(I’m not sure if splitting out the neurons in blocks like that is actually done though)