Using pre-trained tokenisers and Embedding layers

Amit_Bhartia · April 12, 2024, 4:46am

HuggingFace Question
Can I use a tokenizer trained for a model (can be BERT, RoBERTa etc.) and use the pre-trained Embeddigs learned for GPT-2?

If yes, how can one do that?

arvyzukai · April 12, 2024, 5:24am

Hi @Amit_Bhartia

In general no, you cannot. You can use them per se, but you would have to retrain the entire GPT-2 model on these tokens which would not make much sense (randomly initialized values would probably as good as these). Also the tokens are usually optimized for the models they were trained, so using BERT tokens for GPT2 even with full retraining should be sub-optimal.

Cheers

Amit_Bhartia · April 12, 2024, 5:37am

Hi @arvyzukai,

Thanks for quickly clarifying!

I was thinking of using the GPT4 tokenizer with the BERT embedding as the GPT4 embeddings are paid, hence my query.

Any ideas for that?

Best Regards,
Amit

arvyzukai · April 12, 2024, 10:41am

Not a good idea because they are completely different architecture.

Why you don’t want to use BERT tokenizer?

Amit_Bhartia · April 12, 2024, 11:39am

Wanted to try out a pre-trained sub word tokenizer. And the one from GPT4 has been vastly improved. Wanted to see how that impacts training of my model.

arvyzukai · April 12, 2024, 11:46am

It very much depends on the dataset that you have.

Some tokenizers are better than others out of the box (considering wide use cases) but you would have to retrain the entire model to see if it actually “improved” for the model you have.

Retraining the entire BERT model with a different tokenizer would be too costly and I would guess the improvement (if any) would not be worth it.

Amit_Bhartia · April 12, 2024, 12:07pm

Agreed. Thanks for the guidance!

Topic		Replies	Views
The uses of Tokenizer Generative AI with Large Language Models week-module-1	1	380	October 2, 2023
How to apply NLP tools like tokenization and embedding on other languages apart from English? NLP with Classification and Vector Spaces week-module-3	2	145	September 6, 2023
LLM Paper - Knowledge AI Discussions ai-discussions , large-language-model	6	261	February 18, 2024
Question tokenizer PEFT training Generative AI with Large Language Models week-module-2	3	178	May 1, 2024
Considerations on Stemming and Pre-trainned word embeddings NLP with Classification and Vector Spaces week-module-3	4	163	September 29, 2023

Using pre-trained tokenisers and Embedding layers

Related topics