I explained a simple example of Embedding weights here.
If you look at the code, trax uses RandomNormalInitializer(1.0)
which is just a Random Normal distribution (red curve).
(Side note, don’t worry if you do not understand: I’m not sure you would want to know more details now about weight initialization and their sizes but in short, here 1.0 means that by default random Normal distribution is multiplied by 1 - so not changed, but when you have big models you might want to initialize with smaller weights).
If you ask about the start (when the model is initialized for the first time - before training (or “seeing” any example), the embedding table is just random numbers with Normal distribution.
Now, when we train the model, we have chosen its architecture (or in other words, made design choices) - how are we going to make predictions?
If we decided that word(token) order does not matter, then the approach is called Bag of Words. If we care about the words around the word(token) then the approach is called Continuous Bag of Words (but the order still could not matter, for example set([“I”, “love”, “learning”]) could be the “context” and the “target” could be [“NLP”]).
So, if we decided this is the way to go, then yes, the Embedding table would be updated in accordance if the model correctly predicts [“NLP”] when the inputs are set([“love”, “I”, “learning”]). But if we would have chosen other path (how we provide inputs, RNN or Transformer etc.) then the Embedding table would be updated in accordance how the model is able to predict those outcomes.