C4W2 programming assignment: small question about Encoder layer input x dimension

Below is encoder layer code:

class Encoder(tf.keras.layers.Layer):
“”"
The entire Encoder starts by passing the input to an embedding layer
and using positional encoding to then pass the output through a stack of
encoder Layers

"""  
def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size,
           maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
    super(Encoder, self).__init__()

    self.embedding_dim = embedding_dim
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, self.embedding_dim)
    self.pos_encoding = positional_encoding(maximum_position_encoding, 
                                            self.embedding_dim)


    self.enc_layers = [EncoderLayer(embedding_dim=self.embedding_dim,
                                    num_heads=num_heads,
                                    fully_connected_dim=fully_connected_dim,
                                    dropout_rate=dropout_rate,
                                    layernorm_eps=layernorm_eps) 
                       for _ in range(self.num_layers)]

    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    
def call(self, x, training, mask):
    """
    Forward pass for the Encoder
    
    Arguments:
        x (tf.Tensor): Tensor of shape (batch_size, seq_len, embedding_dim)
        training (bool): Boolean, set to true to activate
                    the training mode for dropout layers
        mask (tf.Tensor): Boolean mask to ensure that the padding is not 
                treated as part of the input

    Returns:
        x (tf.Tensor): Tensor of shape (batch_size, seq_len, embedding_dim)
    """
    seq_len = tf.shape(x)[1]
    
    # Pass input through the Embedding layer
    x = self.embedding(x)  # (batch_size, input_seq_len, embedding_dim)
    # Scale embedding by multiplying it by the square root of the embedding dimension
    x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
    # Add the position encoding to embedding
    x += self.pos_encoding[:, :seq_len, :]
    # Pass the encoded embedding through a dropout layer
    # use `training=training`
    x = self.dropout(x, training=training)
    # Pass the output through the stack of encoding layers 
    for i in range(self.num_layers):
        x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, embedding_dim)

Regarding comments to call method: Arguments:
x (tf.Tensor): Tensor of shape (batch_size, seq_len, embedding_dim)

why does input to call method x have embedding_dim as it’s last dimension? It has not passed through embedding layer yet… ?
Thank you.
DS

2 Likes

Interesting question! I tried following Embedding docs, but could not see why emb dim would go as input. I would like to invite other NLP mentors for this answer. Perhaps the the emb dim is taken into consideration to take care of the masking for that input by the Embedding layer? @arvyzukai

1 Like

Hello @Dennis_Sinitsky
If question is related to this

Pass input through the Embedding layer
x = self.embedding(x) # (batch_size, input_seq_len, embedding_dim)

then notice this x is from the self.embedding recall function
self.embedding = tf.keras.layers.Embedding(input_vocab_size, self.embedding_dim)

and
self.embedding_dim = embedding_dim while defining the number layers, embedding dim and number of heads and connect dims.

Regards
DP

Hi Deepti,
yes, I realize what you are saying. However, the comment header

Arguments:
    x (tf.Tensor): Tensor of shape (batch_size, seq_len, embedding_dim)

refers to x as input to the program, i.e. for x = self.embedding(x)
it refers to x on the right side of equation, not left side. I think that comment was a small typo.
DS

x= self.embedding(x) is passing the input(x) through embedding layer using self.embedding to the x(tf.tensor) i.e. batch_size, seq_len, embedding_dim.

right x is an argument where left side x recalled for the step mentioned.

In Python, arguments are values that are passed to functions, methods, or classes when they are called. Arguments can be used to provide additional information to the function, method, or class, or to change the way that it behaves. There are two types of arguments in Python: positional arguments and keyword arguments.

Regards
DP

Hi Deepti,
I think I get your point. However, for x=self.embedded(x) let’s rewrite as x1=self.embedded(x). Then x does not have embedding dimension in its shape, but x1 will have it. But it is x that is argument of the method ‘call’ of class Encoder, not x1 which is intermediate variable.
Am I understanding wrong?
Dennis

Hi @Dennis_Sinitsky

You are correct that the docstring is incorrect (or typo).
The input x for the call method is (batch_size, seq_len) and the embedding dimension is appears only after the embedding layer.
I will submit it for correction :+1:

Cheers

1 Like

That is right. the docstring is incorrect. It can be confirmed by performing a print statement at Transformer call function print(f"Transformer input_sentence: {input_sentence.shape}") which gives Transformer input_sentence: (1, 7) while testing the Transformer.

1 Like

I believe the padding should have been applied before feeding to the Transformer, so the input should probably be (1, 150) or similar, but I’m not sure which part you’re testing. In theory, the Encoder should be receiving the padded tensor.

Another minor issue with the Encoder docstring is that the output is also defined as (batch_size, seq_len, embedding_dim), while in the DecoderLayer, the encoder output is defined as (batch_size, input_seq_len, fully_connected_dim). These could be the same size (or not, depending on implementation), but the variable naming, I believe, could cause some confusion to learners… So I don’t know… Does anyone find it confusing or it’s fine?

P.S. the Decoder and Transformer input has similar docstring issues (input x should not have the fully_connected_dim in the brackets.

1 Like

Yes, the input to encoder is after padding is applied. I was testing under Transformer class, test your function section, which has input sentence_a = np.array([[2, 3, 1, 3, 0, 0, 0]]) including padding.

Regarding the docstrings and comments used to refer Encoder output and encoder output in DecoderLayer (similarly in Decoder) should use same nomenclature. This can indeed cause confusion.

you are right about the Decoder docstring too.

1 Like

:slight_smile: Oh, I see… Then the strange choice of padded length (7) is what confused me :slight_smile: (but looking at the toy test case it kind of makes sense)

1 Like

the argument has typo mistake but not the code line mention as it included the embedding_dim for the step being mention for input passing through the embedding layer.

Correct Argument:
x – Tensor of shape (batch_size, input_seq_len)(HERE INCLUDING EMBEDDING DIM IS A MISTAKE)

Regards
DP

1 Like

That is true, but as jayant mentioned, it is not entirely obvious that the code comment is about the output of the embedding layer (and not the input; if one would look at the docstring, it could definitely mislead learners that the code comment is talking about the input).

Also, the return x statement has the same code comment, but if we look at the Decoder’s docstring, the embedding_dim is changed to fully_connected_dim.

In any case, I submitted all the issues with the docstrings for fixing. Hopefully they will be clearer soon.

1 Like