Possible Discrepancy in Markdown: Assignment

Hey Guys,
In the markdown for the section “3.2 Dense Class”, it is mentioned

The number of rows in the weight matrix should equal the number of columns in the variable x. Since x may have 2 dimensions if it represents a single training example (row, col), or three dimensions (batch_size, row, col), get the last dimension from the tuple that holds the dimensions of x.

If we are implementing a Dense layer, how could x have 3 dimensions. Aren’t we supposed to flatten our samples before passing them into a Dense layer? So, if we are not following the convention of reshaping the input when we have a single sample to include the batch-size (like the one followed in tensorflow), shouldn’t the input’s dimensions be restricted to 1 or 2?

I found this thread suggesting the same, so is it a typo here?

Cheers,
Elemento

Hi @Elemento

Correction - when talking about Dense layer in mind:
It should perform a batch matrix-matrix product of matrices broadcasting of last two dimensions stored in Tensor1. For example, if Tensor1 is of shape (b×n×m) and W is of shape (m×p) then the output should be a tensor of shape (b×n×p).

In NLP often the first dimension is batch, the second is sequence (of words or subwords) and the third is feature dimension (“Embedding”). Usually the concept of “padding” is needed for sequence dimension so that the calculations would be possible.

Cheers

Hey @arvyzukai,
I am a little confused. We are supposed to perform the operation xW. Now, in this, the Tensor2 is the weights matrix of the dense layer, which as per Tensorflow’s implementation or Pytorch’s implementation is always a 2D matrix. Even in the markdown, it has been mentioned that

The second parameter is a tuple for the desired shape of the weights (num_rows, num_cols)

So, how can Tensor2 have 3 dimensions, and not 2? Now, comes the second query. In a dense layer, each of the neurons is supposed to be influenced by each of the inputs, which necessitates that the each of the weights vector corresponding to each of the neurons gets into a dot product with the entire sample. How can that happen in the matrix multiplication that you are proposing?

In fact, all the text pre-processing schemes that we have discussed so far aims to encode sentences into a vector and not into a matrix, for instance, by averaging the word vectors, or by concatenating them. Just to make sure, I am not including sequential models as for this discussion, just talking about the standard neural networks, that we have discussed in Week 1 so far.

Cheers,
Elemento

Hey @Elemento

We are supposed to perform the operation xW.

We are performing this operation on last two dimensions (row, col). In other words, when we do matrix multiplication of 3D matrices we do multiple multiplications of 2D matrices - a dot product between their row/column vectors.

I’m rusty on Tensorflow, but I’m pretty sure it is it’s default behaviour. PyTorch uses torch.bmm — PyTorch 2.1 documentation.

In this course trax uses JAX jnp.numpy.dot which by default uses this behaviour.

You can try yourself the calculations with numpy.matmul:

A = np.random.randint(0, 10, size=(3, 3, 2))
B = np.random.randint(0, 10, size=(3, 2, 4))

C = np.matmul(A, B)  # results in shape of (3, 3, 4)

In fact, all the text pre-processing schemes that we have discussed so far aims to encode sentences into a vector and not into a matrix, for instance, by averaging the word vectors, or by concatenating them.

Actually, concatenating would not reduce the dimensionality (it would be a very long vector or you can chop this vector into a matrix or smth else, but it would not lose information). On the other hand averaging does reduce dimensionality of the output and it loses information - you cannot go back to original after you averaged the output.

In NLP, and in other domains, working with 3D Tensors is very common and this operation is used throughout (later in) this course and everywhere else so it is important to fully grasp what is going on.

I don’t think I need to get into the details just now if (I assume) you haven’t got into the Course 4 on Attention models. For now, you can think of it as similarity scores calculation (cosine similarity if the inputs are normalized) for batches of data (as I mentioned, the common input in NLP is (batch_size, sequence_length, feature_size) ).

Feel free to ask questions if you want me to elaborate.

Cheers

Hey @arvyzukai,
My query doesn’t revolve around how a batch matrix multiplication works. It revolves around whether the dense layer that we are implementing in this assignment is same as the one that we use in a Standard Neural Network or not?

For the Dense layer of a standard neural network, we always flatten the inputs, so that the input has only 2 dimensions, (batch-size, features), but are you saying that we can also feed 3-dimensional inputs (batch_size, feature1, feature2) to a dense layer? If so, then why in “image classification problems with standard neural networks”, we flatten our images and then feed them to the dense layers? Why not feed them as is?

Additionally, if we can feed 3-dimensional inputs to dense layers, then, will the weight matrix be 2-dimensional (as in a standard neural network), or will the weight matrix be 3-dimensional? If 3-dimensional, then what does the different dimensions represent? For the 2-dimensional weight matrix, the first dimension represented the number of input nodes, and the second dimension represented the number of output nodes, so what representation is followed by 3-dimensional weight matrix?

I completely agree with the above point, and that’s why we are using Sequence Models, aren’t we, so that we don’t have to concatenate or average the word embeddings of the input words?

Cheers,
Elemento

Hey @Elemento

My query doesn’t revolve around how a batch matrix multiplication works. It revolves around whether the dense layer that we are implementing in this assignment is same as the one that we use in a Standard Neural Network or not?

Yes - it should be able to handle 3D input. And it is not a typo - this layer has to know “column” dimension. If you implemented “Exercise 04” of C3_W1 correctly I encourage you try the cell right after it with:

z = np.array([
    [
        [2.0, 7.0, 25.0],
        [3.0, 5.0, 60.0]
    ],
    [
        [5.0, 8.0, 23.0],
        [1.0, 4.0, 10.0]
    ],
]) # input array 

and see the result yourself.

For the Dense layer of a standard neural network, we always flatten the inputs, so that the input has only 2 dimensions, (batch-size, features), but are you saying that we can also feed 3-dimensional inputs (batch_size, feature1, feature2) to a dense layer?

I’m not sure what you are calling “standard neural network” but most probably you’re referring to fully connected feed-forward neural networks (CNNs, RNNs and other are not in this category). If this is the case, then the inputs are already flat - features usually are a vector ([height, weight, distance, yes/no etc.]) and a mini-batch of these data are a 2D Tensor (matrix) (batch-size, features). Maybe I’m stating the obvious, but again - depending on your needs, you have the ability to feed 3-dimensional inputs but usually it is not (batch_size, feature1, feature2)

If so, then why in “image classification problems with standard neural networks”, we flatten our images and then feed them to the dense layers? Why not feed them as is?

With the exception of toy greyscale MNIST datasets we rarely classify images with fully connected feed-forward networks. How about RGB images? Do you flatten them too?
But if you include CNNs as a standard neural network, then I don’t agree regarding flattening - in CNNs we do not flatten images.

Additionally, if we can feed 3-dimensional inputs to dense layers, then, will the weight matrix be 2-dimensional (as in a standard neural network), or will the weight matrix be 3-dimensional?

The W weight matrix will be 2-dimensional (b 1D). If you are wondering what is the point of this - think vectorization / paralel computing (you don’t have to wait for other batch entries being calculated, you can distribute each sentence for different workers).

I completely agree with the above point, and that’s why we are using Sequence Models, aren’t we, so that we don’t have to concatenate or average the word embeddings of the input words?

Again - I would argue that we usually “concatenate” the outputs. (But it’s not a must, you can also average, take the last output and other if it helps with your metrics.) For example, each output of RNN is stacked on to the matrix (a form of concatenation), but sometimes you can take the last state of RNN and that would be your sequence representation.

Cheers

Hey @arvyzukai,
By Standard Neural Networks, I am referring to fully-connected feed-forward neural networks only. Sorry for the confusion. If you don’t mind, can you please update your answer so that, we don’t consider the possibility about CNNs, RNNs, etc.

Cheers,
Elemento

Hey @Elemento

OK, for fully-connected feed-forward neural networks we use Dense/Linear layers, which makes use of broadcasting (I think best documentation on that is torch.matmul — PyTorch 1.13 documentation).

For example:

PyTorch

TensorFlow

Trax

Later in the course, when we work will all the sequence outputs we make use of batch matrix-matrix products.

Cheers

Hey @arvyzukai,
Thanks for your time. I always thought that we could only feed 2D inputs to Dense layers, since that was the way it was taught in MLS, DLS, etc. I did try the code in tensorflow myself, and it was working as you depicted, but still, I was doubting it since the way, the Dense layer is usually explained is not with 3D inputs.

Thanks a lot for all your time.

Cheers,
Elemento

2 Likes