In lab 1&2 where we were tokenizing our own datasets, we passed the following parameters to the Embedding later when defining the model:
tf.keras.layers.Embedding(**vocab_size, embedding_dim, input_length=max_length)
However, when using pre-tokenized subwords, we are only passing 2 parameters into the Embedding layer:
tf.keras.layers.Embedding(tokenizer_subwords.vocab_size, embedding_dim)
Also, the dimension of the Embedding layer is different; it has 2 None:
embedding (Embedding) (None, None, 64) 523840
Can someone explain what’s happening and why this difference is between the 2 versions?
Thanks!
Output shape of an embedding layer stands for (BATCH_SIZE, MAX_LENGTH_OF_INPUT_SENTENCE, EMBEDDING_DIM_PER_TOKEN)
. When we skip the input_length
parameter, tensorflow can’t infer the maximum length of input sentence and hence the None
in the 2nd dimension.
The exercises in general contain a tokeniner forllowed by a padding call which encodes words / subwords into integers and pads / truncates each row to the provided length. The advantage of providing input_length
is that model summary is clear. The drawback of specifying the input_length
parameter is that data should always have the same shape. For instance, specifying input_length
as 120 means that all training / testing data should be of the form (BATCH_SIZE, 120)
to the embedding layer. The means that if there was a big batch size, you’ll needlessly pad smaller sentences to the maximum length. On the other hand, if you don’t provide the input_length
parameter, you can pad each batch to the maximum sentence length for that batch and hence is more efficient.
1 Like