W4 Assignment 1 Exercise 8 Are the input dimensions of our transformer model correct

I’m looking at the provided apidocs of the transformer’s call method that we are implementing as part of Week4 Assignment 1 Exercise 8.

“”"
Forward pass for the entire Transformer
Arguments:
input_sentence – Tensor of shape (batch_size, input_seq_len, embedding_dim)
An array of the indexes of the words in the input sentence
output_sentence – Tensor of shape (batch_size, target_seq_len, embedding_dim)

“”"

Is the input/output tensors shape correct to have the embedding dimensions? Since we have an embedding layer in the encoder/decoder that takes as input source/target vocabulary size and has embedding_dims number of hidden units I expected here the shape of the input to be (batch_size, input_seq_len, source_vocabulary_size) and basically the input to consist of one-hot encoded vectors for each word in a sample. Meaning that the model learns the embeddings on it own. Am I correct in my assumptions or am I missing something?


While describing the problem I also realised that the encoder and decoder are using separate embedding layers. Doesn’t that introduce different representation of words in the encoder and decoder?

This has already been reported and fixed. Please confirm if refreshing your workspace doesn’t get the new version of the notebook where doc reads as follows:

        Forward pass for the entire Transformer
        Arguments:
            input_sentence -- Tensor of shape (batch_size, input_seq_len)
                              An array of the indexes of the words in the input sentence
            output_sentence -- Tensor of shape (batch_size, target_seq_len)
                              An array of the indexes of the words in the output sentence

As with a NN, you can fine-tune the embeddings layer of train it from scratch. Do see transfer learning videos in course 4 to jog your memory on that topic.

In tensorflow, you don’t have to encode words as one-hots. Calling an embedding layer using integers in range [0, VOCAB_SIZE) will do the job. Take a look at this example.

Let’s consider an english to french translation system.
Embedding layer in the encoder is used to represent english words and the embedding layer in the decoder is used to represent french words. So, embedding layers aren’t shared across encoder / decoder.

2 Likes

Thank you for the response. I’ve completed the course ~2 months ago and had the notebook saved. That’s why I couldn’t see the updated version. Thank you for the explanation too. Very helpful.