Tokenizer vocab size, how do we decide?

gaspher · April 9, 2022, 10:01pm

I keep seeing the arbitrary value of 10,000 used as a vocab size for Tokenizer.

What should our strategy be for determining that parameter for Tokenizer?

GDP

ai_curious · April 9, 2022, 10:27pm

I haven’t found an authoritative statement about how to set that parameter. I did find several papers on the interweb where people experimented with different statistical measures. There seemed to be some consensus that optimal vocabulary size depended somewhat on the specific NLP task at hand. Some ended up with a summarization like this one use the largest possible vocabulary such that at least 95% of classes have 100 or more examples in training. If you find anything decisive, please share.

Topic		Replies	Views
How to to determine max_length? Natural Language Processing in TensorFlow week-2 , week-3 , week-4	3	796	April 6, 2022
Where is VOCAB_SIZE set? Natural Language Processing in TensorFlow	1	319	October 17, 2022
C3W2 Lab1: vocab_size and reverse_word_index length Natural Language Processing in TensorFlow week-2 , week-3 , week-4	3	509	May 7, 2023
Vocabulary size differs from your preloaded V dimension of W1,W2 and b2 for testing NLP with Probabilistic Models week-4	5	517	March 27, 2023
Right-Sizing Models for the Dataset: Finding the Best Data-To-Parameter Ratio for NLP Models AI Discussions the-batch , ai-discussions	1	76	May 20, 2023

Tokenizer vocab size, how do we decide?

Related topics