Tokenizer vocab size, how do we decide?

I keep seeing the arbitrary value of 10,000 used as a vocab size for Tokenizer.

What should our strategy be for determining that parameter for Tokenizer?


I haven’t found an authoritative statement about how to set that parameter. I did find several papers on the interweb where people experimented with different statistical measures. There seemed to be some consensus that optimal vocabulary size depended somewhat on the specific NLP task at hand. Some ended up with a summarization like this one use the largest possible vocabulary such that at least 95% of classes have 100 or more examples in training. If you find anything decisive, please share.