Why we need to add the padding as possible output in Text generation Model

See the example here:

# Define the total words. You add 1 for the index 0 which is just the padding token.
total_words = len(tokenizer.word_index) + 1

And this as well:

# Ignore if index is 0 because that is just the padding.
if predicted != 0:

@nitulkukadia thats a good question!
Padding is a preprocessing step that needs to be done with all the input sequences. Usually we pad the sequence to the length of the longest sequence in the input. This padding is usually represented with zeros in the sequence and so we need to reserve 0 as a padding token while creating our dictionary. This is similar to when we also reserve special tokens like [UNK] (unknown), [SOS] (start of sequence), [EOS] (end of sequence) etc, in our dictionary depending on our need.

Hi @jyadav202 Thanks for your prompt response.

With respect to dictionary, we need and entry for padding as we are passing those with sentences.
But why do we need this in dense layer, which predicts the possibility of next word, here there is no need o predict the possibility of the padding right ?

Sample code in git repository:
# Build the model
model = Sequential([
Embedding(total_words, 64, input_length=max_sequence_len-1),
Bidirectional(LSTM(20)),
Dense(total_words, activation=‘softmax’)
])

I am thinking we should change Dense layer to total_words -1

# Build the model
model = Sequential([
Embedding(total_words, 64, input_length=max_sequence_len-1),
Bidirectional(LSTM(20)),
Dense(total_words - 1, activation=‘softmax’)
])

Hi @nitulkukadia! thanks for your question.

A more technical reason for not doing total_words-1 is:
The Dense layer needs to have the same dimension as much as we have categories (ys.shape) otherwise the model won’t fit the data. Since this is a multiclass problem, we created one-hot array (ys) and later the model with categorical_crossentropy to compute loss. They both need to have same shapes.

Because of the above reason, padding tokens will be assigned a probability during next word prediction. To avoid appending that to our original input, we do if predicted != 0 and skip them.

This assignment is a primitive way of doing next-word prediction as you can see that the predictions are not good. For this, we need other mechanisms like having a special end of sequence [EOS] token.

1 Like

Thanks @jyadav202, I am good with the clarification provided.