Could someone explain what embedding is and why it is needed in an intuitive way?
As a general concept:
- Embedding tells you the relationships between sequences of words.
- Given an initial word, this lets you make predictions about what words may follow.
Thanks! How is embedding related to tokenization?
In essence (not technically complete):
-
Sentences are composed of tokens. Tokens are the standardized blocks that make up a sentence. A token might be the root form of a word, and it might also include punctuation.
-
Embeddings give you the relationships between tokens within a specific language.
Thanks! Very helpful!
To continue the conversation about tokens and embedding - what happens if we have a special slang or industry term in our data that the LLM hasn’t been trained on?
- would we have to do full fine tuning to teach the model how the new word relates to all other words? Or somehow piggyback off the knowledge the LLM has of the other synonyms?
- and how can we do the tokenisation of the new word in the first place?