When I was doing this lab I noticed that the way that tensorflow expresses the W (weights matrices) is as (number of features in input, number of units in the layer). So in this lab the W1 matrix is a (2,3) matrix.

The thing that confuses me is that in order to get a output vector from layer 1, you need to calculate a vector of 1 x 3 or 3 x 1. If we take into account that the formula for a single neuron in a layer is w . x + b. There is no posible value of x to get a 3 x 1 or 1 x 3 vector as result. Because an m x n matrix multiplied an n x p matrix has m x p result. So 2 would be m in this case.

I guess that it is a tensoflow notation and how the matrix multiplication is done, works in a different way. But it would be nice to have that one sorted out.

You may have to use a transposition.

This is not unusual in machine learning, because there is no universal standard for how the data is oriented in the X matrix.

I came here to ask the same question, so Iâ€™m glad itâ€™s been asked already. In, fact I found that the same question was also asked in 2022: Weight matrix dimension in TensorFlow. That has part of the answer:

The key message is that thereâ€™s apparently a performance and computational accuracy benefit to storing W in transposed form and then doing the transposed matrix dot product.

Iâ€™m just as novice as anyone else here, but let me attempt to offer some thoughts about why that might be the case.

Firstly, from the point of view of the lectures, given a single input x with M features, and a single dense output layer with K outputs (ie: K neurons in this single layer), ignoring the bias and activation function, we have the weights matrix W and the dot product as follows:

Iâ€™ve highlighted one row in the W matrix, the single column in the vector, and the resultant output cell in the activation vector. One important feature as I see it is that the highlighted row and column represent the input features to this layer. They must have the same size as each other. Additionally, the orientation of that highlighted row and column are always this way - because thatâ€™s how matrix dot products work. For example, you canâ€™t just put x before W and flip the orientation.

When we do a prediction, say, using only a single data sample, then we just have that list of features and nothing else. But, importantly, those features are all equally part of the one data sample.

Now consider a real data set. By convention of data science, rows are the samples and columns are the features for each sample:

As mentioned in the lecture, TensorFlow was designed for large data sets. In particular, it is usually run in â€śbatch modeâ€ť, calculating the results across multiple samples in parallel. If we take the same a = W \cdot x + b pattern and lay that out, we get a dot product between two matrices, and an activation matrix as the result: A = W \cdot X^T + b, as follows:

Notice that weâ€™ve got very *wide* X and A matrices, for the N samples, which might be in the thousands or more.

Thereâ€™s two problems that I see with that:

Problem #1: Have to transpose original dataset X before use:

- Matrices are internally stored as a flattened 1D array, with each row concatenated.
- To transpose a large matrix, you have to construct a brand new 1D array (think: memory consumption), and then run through the source array in a weird to-and-frow order, copying over values to form newly shaped rows.
- For small matrices this is fine, but for large datasets itâ€™s a problem.

Problem #2: Batching splits on the wrong axis

- For large data sets, the input rows are typically split up into batches of, say, 32 rows at a time. That enables us to maximise the use of the GPUâ€™s internal capacity, but also to handle data sets that are larger than would fit in the GPU memory.
- That batching obviously works on the
*rows*in the data representation. - If the X matrix has been transposed, then weâ€™ll be batching across the rows in the transposed matrix, which means that weâ€™ll be splitting it out at 32 features per batch, and still representing all N samples, eg: a (32x10000) matrix. Never mind the fact that this wonâ€™t fit in the GPU, because weâ€™ve basically corrupted our data. The alternative would be to batch against the original X, and then copy each batch at at time, but itâ€™s still just a mess of extra computations.

The solution:

- Keep X as is, and transpose W instead.
- If we always operate on W in the transposed form, then we never need to transpose it back and forward, so we just eliminate that problem altogether.

Now, we basically transpose the entire equation, resulting in A = X \cdot W^T, and we always operate on the weights in their transposed format. This gives the following:

To highlight how much better this is, consider how we can split out the data when thereâ€™s too much to fit into the GPU in a single go. Firstly, we can easily batch the data set 32 rows at a time. Secondly, because each neuron in a layer is independent of the others, we can calculate them in parallel - or in batches. So if weâ€™ve got too many neurons in the layer, we can split them out into batches too. Thus we can split this operation as follows, without causing problems:

(Iâ€™m not sure if splitting out the neurons in blocks like that is actually done though)

Wonderful analysis, @malcolm.lett!

We can define any W or X as we like, as long as the maths can work out. Tensorflow defines it this way, so there will be no transpose at all:

where

- m: number of samples
- n, nâ€™ number of input/output features

m the number of samples is always the leading dimension.

Cheers,

Raymond