W4 Assignment 1 Exercise 8 Are the input dimensions of our transformer model correct

Ivan_Kirchev · January 9, 2024, 12:18pm

I’m looking at the provided apidocs of the transformer’s call method that we are implementing as part of Week4 Assignment 1 Exercise 8.

“”"
Forward pass for the entire Transformer
Arguments:
input_sentence – Tensor of shape (batch_size, input_seq_len, embedding_dim)
An array of the indexes of the words in the input sentence
output_sentence – Tensor of shape (batch_size, target_seq_len, embedding_dim)
…
“”"

Is the input/output tensors shape correct to have the embedding dimensions? Since we have an embedding layer in the encoder/decoder that takes as input source/target vocabulary size and has embedding_dims number of hidden units I expected here the shape of the input to be (batch_size, input_seq_len, source_vocabulary_size) and basically the input to consist of one-hot encoded vectors for each word in a sample. Meaning that the model learns the embeddings on it own. Am I correct in my assumptions or am I missing something?

…
While describing the problem I also realised that the encoder and decoder are using separate embedding layers. Doesn’t that introduce different representation of words in the encoder and decoder?

balaji.ambresh · January 9, 2024, 2:12pm

This has already been reported and fixed. Please confirm if refreshing your workspace doesn’t get the new version of the notebook where doc reads as follows:

        Forward pass for the entire Transformer
        Arguments:
            input_sentence -- Tensor of shape (batch_size, input_seq_len)
                              An array of the indexes of the words in the input sentence
            output_sentence -- Tensor of shape (batch_size, target_seq_len)
                              An array of the indexes of the words in the output sentence

As with a NN, you can fine-tune the embeddings layer of train it from scratch. Do see transfer learning videos in course 4 to jog your memory on that topic.

In tensorflow, you don’t have to encode words as one-hots. Calling an embedding layer using integers in range [0, VOCAB_SIZE) will do the job. Take a look at this example.

Let’s consider an english to french translation system.
Embedding layer in the encoder is used to represent english words and the embedding layer in the decoder is used to represent french words. So, embedding layers aren’t shared across encoder / decoder.

Ivan_Kirchev · January 9, 2024, 3:45pm

Thank you for the response. I’ve completed the course ~2 months ago and had the notebook saved. That’s why I couldn’t see the updated version. Thank you for the explanation too. Very helpful.

Topic		Replies	Views
C5W4 Questions after finish the course Sequence Models	5	264	December 30, 2023
Course 5 - Week 4 - understanding EncoderLayer dimensions Sequence Models	2	1223	May 14, 2021
Course 5 Week 4 Assignment Exercise 8 code comment typo? Sequence Models	1	591	July 21, 2021
Transformer Architecture: Fully connected Dimension Exercise 4 Sequence Models	4	533	May 16, 2022
Wrong comments in the assignment of C4W2 NLP with Attention Models general	3	77	June 19, 2024

W4 Assignment 1 Exercise 8 Are the input dimensions of our transformer model correct

Related topics