How to to determine max_length?

bluetail · March 31, 2022, 11:11am

In week 2, in the notebook for the lecture video of the IMDB dataset, we have set
max_length = 120

and in the Sarcasm dataset, max_length = 32.

how did we came up max_length = 32 for example? what would be the code for that?
likewise, how did we decide that vocab_size should be 10000?

Chris.X · April 5, 2022, 9:30pm

Hey @bluetail,

Here is just my cents, maybe I was wrong, just leave it here.

They are hyperparameters. If you know the tradeoff between overfitting and underfitting, you could figure out that by adjusting those parameters you could get different effects of this tradeoff.

Under the hood of

max_length could be a rough num that you think is was enough.

I guess that you use it for pad_sequences. Here is an official explanation on this

maxlen: Optional Int, maximum length of all sequences. If not provided, sequences will be padded to the length of the longest individual sequence.

Even if you do not set it, it will give a max length internally, however, the tradeoff will happen based on this “internal” lenght.

I guess that you use vocab_size for Tokenizer's num_words. Here is official:

num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

You don’t have to consider all words into model. As mentioned, it would be a rough number. However, when you good at Python, you can calcuate all used words in the text and select a number, ie 10000 to feed the Tokenizer, of couse, this will effect the word_index or index_word, something like this, surely on the tradeoff .

Hopefuly it helps, maybe I was wrong, just give some hints.

Chris.X · April 5, 2022, 9:32pm

Another point.

DL or ML, according to my experience, there aren’t absolutely correct values for (hyper)parameters. There are only the ones JUST RIGHT.

Hopefully, it helps.

ai_curious · April 6, 2022, 1:24am

My thought on this, and maybe its another way of saying what @Chris.X says above, is these numbers are typically obtained empirically. That is, through experiment and measurement. And, like all things engineering, there are tradeoffs. At some point, the marginally increased accuracy of larger vocabulary and sentence length isn’t worth the added cost in memory footprint and runtime to include them. By running experiments using different thresholds you can be data driven about your selection of these parameters.

For these classes, however, the numbers are often set at a level that performs adequately enough to learn concepts while keeping the cost of elastic cloud computing at a manageable level for deeplearning.ai. You should expect to have bigger everything when you do this in a production environment.

Topic		Replies	Views
The Model for this assignment Natural Language Processing in TensorFlow	14	427	August 2, 2022
04_Data_preparation_lab_student - tokenize_function Finetuning Large Language Models	0	120	February 7, 2024
How to decide optimal values of hyperparameters for embedding layer(output vector dimension and max length) based on data? Natural Language Processing in TensorFlow week-2 , week-3 , week-4	2	394	October 1, 2023
Bug in W2A2/Emoji_v3a.ipynb Sequence Models	1	497	February 1, 2023
C3 W1 assignment: Vocabulary contains 29608 words instead of 29714 Natural Language Processing in TensorFlow week-1	4	644	June 27, 2022

How to to determine max_length?

Related topics