The weight matrix from mean layer to dense layer change every batch

In every batch, the output of mean layer is: batch_size * max_lenght, but the max_lenght change every batch, it can be 14, or can be 11, or something else. So the weight matrix always change.

My question is how the model know the weight matrix shape, and also the input shape of the dense layer. If its always change, how we learn the weight matrix because it not fixed

There should normally be some padding involved that all inputs are of the same length either truncated to it or added a special token to reach that length.

yes but in the code, the pad is added every batch. But every batch have the max lenght different

The batch length does not matter, it can vary, if you use for eg. batch gradiend descent the batches taken from the training set will not always be equal in size.

Hmm, maybe you know wrong with my opinion, sorry about that. I mean that the max length of the sentence in every batch is different, not the batch length.

When it enters the model all sentences must be the same length.

Yes, this must be the same lenght. But Trax will do it automatically for us right?

Hi @trungsanglong25

No, trax would not do that (neither TensorFlow, nor PyTorch would do that too). That is why we implement the data_generator function which prepares the inputs the way we want them.

Regarding you previous point:

That is not really true. The output of the mean layer is (batch_size x embedding_dim) and the following weight matrix of the dense_output_layer never changes (its always embedding_dim x output_dim, or 256x2 in this Assignment).

Here is an example with batch size = 1:

# example sentence:
'The accuracy of the model is very unique'

### preprocess strips this sentence
### __UNK__, model, uniqu

# inputs
DeviceArray([[   2, 2378, 2939]], dtype=int32)

# Embedding_9088_256 layer outputs.shape
(1, 3, 256)

# Mean layer outputs.shape
(1, 256)

# Dense_2 layer outputs.shape
(1, 2)

# an example of outputs:
DeviceArray([[ 0.548059 , -0.5545566]], dtype=float32)

Cheers

2 Likes

Wow thanks for your answer, I think i can understand now. And I think the max_length just only can affect the Dense layer if our model doesn’t have the Mean layer. Now with the example above, it will be:

# example sentence:
'The accuracy of the model is very unique'

### preprocess strips this sentence
### __UNK__, model, uniqu

# inputs
DeviceArray([[   2, 2378, 2939]], dtype=int32)

# Embedding_9088_256 layer outputs.shape
(1, 3, 256)

# Dense_2 layer outputs.shape
(1, 2)

# an example of outputs:
DeviceArray([[ 0.548059 , -0.5545566]], dtype=float32)

and with another example:

# example sentence:
'The accuracy of the model is very unique'

### preprocess strips this sentence
### __UNK__, model, uniqu

# inputs
DeviceArray([[   2, 2378, 2939, 4, 5]], dtype=int32)

# Embedding_9088_256 layer outputs.shape
(1, 5, 256)

# Dense_2 layer outputs.shape
(1, 2)

# an example of outputs:
DeviceArray([[ 0.548059 , -0.5545566]], dtype=float32)

Now the max_length has the big affect to the Dense layer.

So in conclusion, the model must need the aggregation layer (like mean, sum, …) or Flatten layer after the embedding layer, it will lead the shape of the embedding to be the same for all example. Is my thinking correct?.

Below is my rebuild model with Tensorflow, is it has the same idea with the model in that course’s week 1:

model = tf.keras.Sequential(
    [tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(units=2)]
)

===========================================================================
embedding_2 (Embedding)     (None, None, 256)         2326784   
                                                                 
global_average_pooling1d_2   (None, 256)              0         
(GlobalAveragePooling1D)                                        
                                                                 
dense_2 (Dense)             (None, 2)                 514 

Hi @trungsanglong25

Not really.

  • In this Assignment, the max_length affect only the shape of the batch matrix. In our previous example when there was only one sentence in the batch, the input batch was of shape (1 x 3), if for example, we had more sentences, then this would could change the last dimension. For example, if there were 8 sentences with max length of 3, the input batch shape would have been (8 x 3) with padding where necessary. Or if there are 8 sentences with max length of 12, then this input batch would be (8 x 12), etc.
  • If there were no Mean layer, the output would have been (1 x 3 x 2), instead of (1 x 2) with the Mean layer). In other words, max_length has nothing to do with Dense layer (the Dense layer only changes the last dimension (the Embedding dimension in this case).

The role of the Mean layer in this case is just to average the Embeddings of all words in each sentence (when we correctly specify the axis). This is not a sophisticated way of doing things normally but for learning purposes - we somehow need to represent the sentence, so the approach in this case is just to average each word (bag-of-words approach) and try to guess the sentiment from that average embedding.

I hope that clears some things for you.

Cheers

1 Like