Hey @bluetail,
Here is just my cents, maybe I was wrong, just leave it here.
They are hyperparameters. If you know the tradeoff between overfitting and underfitting, you could figure out that by adjusting those parameters you could get different effects of this tradeoff.
Under the hood of
max_length
could be a rough num that you think is was enough.
I guess that you use it for pad_sequences
. Here is an official explanation on this
maxlen: Optional Int, maximum length of all sequences. If not provided, sequences will be padded to the length of the longest individual sequence.
Even if you do not set it, it will give a max length internally, however, the tradeoff will happen based on this “internal” lenght.
I guess that you use vocab_size
for Tokenizer
's num_words
. Here is official:
num_words: the maximum number of words to keep, based on word frequency. Only the most common
num_words-1 words will be kept.
You don’t have to consider all words into model. As mentioned, it would be a rough number. However, when you good at Python, you can calcuate all used words in the text and select a number, ie 10000 to feed the Tokenizer
, of couse, this will effect the word_index
or index_word
, something like this, surely on the tradeoff .
Hopefuly it helps, maybe I was wrong, just give some hints.