How are numbers represented in dictionaries/or embedded vectors?
For example, say I had the following analogy triad to complete:
25 → 35 :: 19 → ?
how are the words “25” and “35” represented in a dictionary? (would be ridiculous to represent every number combination …infinite in fact) How are embedded representations created? There must be something there, because in our Operations on Word Vectors programming assignment, the model answered back 19 → 33 ???
If a number was to occur in the text and is considered as part of vocabulary, then, the token maps to an integer in word to index dictionary. This integer has a representation in the embedding layer.
Your guess is right when it comes to representing too many such numbers. One trick that’s often done in NLP is to replace numbers with a special token based on the context.
Imagine your training data contains phone numbers of a bunch of people / businesses. In this case, the actual phone numbers are often replaced with a special token like PHONE_NUMBER. This helps reduce vocabulary.
why do you suppose the word_to_vec analogy model mapped
25 → 35 :: 19 → 33 ?
The reason for complete_analogy('25', '35', '19', word_to_vec_map)
to return 33
is purely based on the embeddings that are produced by glove.
To customize glove, replace the corpus / context window size and train the model.