How to to determine max_length?

In week 2, in the notebook for the lecture video of the IMDB dataset, we have set
max_length = 120

and in the Sarcasm dataset, max_length = 32.

how did we came up max_length = 32 for example? what would be the code for that?
likewise, how did we decide that vocab_size should be 10000?

Hey @bluetail,

Here is just my cents, maybe I was wrong, just leave it here.

They are hyperparameters. If you know the tradeoff between overfitting and underfitting, you could figure out that by adjusting those parameters you could get different effects of this tradeoff.

Under the hood of :point_up:

max_length could be a rough num that you think is was enough.

I guess that you use it for pad_sequences. Here is an official explanation on this

maxlen: Optional Int, maximum length of all sequences. If not provided, sequences will be padded to the length of the longest individual sequence.

Even if you do not set it, it will give a max length internally, however, the tradeoff will happen based on this “internal” lenght.

I guess that you use vocab_size for Tokenizer's num_words. Here is official:

num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

You don’t have to consider all words into model. As mentioned, it would be a rough number. However, when you good at Python, you can calcuate all used words in the text and select a number, ie 10000 to feed the Tokenizer, of couse, this will effect the word_index or index_word, something like this, surely on the tradeoff :point_up: .

Hopefuly it helps, maybe I was wrong, just give some hints.

Another point.

DL or ML, according to my experience, there aren’t absolutely correct values for (hyper)parameters. There are only the ones JUST RIGHT.

Hopefully, it helps.

My thought on this, and maybe its another way of saying what @Chris.X says above, is these numbers are typically obtained empirically. That is, through experiment and measurement. And, like all things engineering, there are tradeoffs. At some point, the marginally increased accuracy of larger vocabulary and sentence length isn’t worth the added cost in memory footprint and runtime to include them. By running experiments using different thresholds you can be data driven about your selection of these parameters.

For these classes, however, the numbers are often set at a level that performs adequately enough to learn concepts while keeping the cost of elastic cloud computing at a manageable level for deeplearning.ai. You should expect to have bigger everything when you do this in a production environment.