Hi team,
I have a question regarding the following slide:
I got confused about the statement “train tokenizer” in the context of post training. Is that really a thing? Wouldn’t that alter existing token IDs and thus make the existing embeddings unusable? I know that it’s always possible to manually add new tokens to a tokenizer / LLM, but the term “training” confuses me. If it refers to a specific technique, I would appreciate if you could provide a brief reference. Thank you!
To treat a new tag like <|THINK|> as a single unit, you must first update the tokenizer’s rules so it recognizes the string as a whole instead of splitting it into subwords or individual characters. This is what it means to “train” a tokenizer: you are teaching it to prioritize your specific tag.
By adding this tag, you increase the total vocabulary count, which requires you to physically resize the model’s embedding layers to accommodate the new slots. Since these new tokens start as a blank slate, you’ll initialize their vectors and then fine-tune the model so it learns the specific logic or meaning behind the tags. Sharon explains it in the final slide of the video.