In the lab it states that the PPO weights dimension are ‘768’ + 1 (bias). How do we get 768? I don’t see where it explains those dimensions. Are those auto set at some point? How did the author know to say it was that?
v/r,
Denise
In the lab it states that the PPO weights dimension are ‘768’ + 1 (bias). How do we get 768? I don’t see where it explains those dimensions. Are those auto set at some point? How did the author know to say it was that?
v/r,
Denise
Hello @D_Blady,
768 is a commonly used embedding size in Transformer models, such as BERT and T5 (and more precisely, their base versions; BERT-base, T5-base, etc - these models are usually released in a number of sizes and with smaller and bigger versions available as well, e.g. 512, 768, 1024, … embedding sizes). In these models, the hidden representations typically match the embedding size.
Generally speaking, the bigger the embedding size, the bigger the model in terms of number of weights and in terms of storage it takes up. A larger embedding size does allow encoding richer representation.
For Flan-T5-Base, you can check the size in the “d_model” parameter here:
And similarly for all other models available through Hugging Face, check the corresponding config.json file
Best
I’m confused. I thought PPO updated the encoder or decoder weights (either the attention heads or the fully connected layer). Does PPO just update the embedding layer in a transformer?
According to the lab description (towards the end of section 2.1, after the PPO code):
In this lab, it is in the context of PEFT that only a small part of the model is updated during PPO:
trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%
A more general approach for the application of PPO would update the whole model, if not stated otherwise, i.e., encoder & decoder, attention heads and fully connected layers.
The reason I mentioned the embedding size earlier, is because in a transformer model, it determines the hidden state size (typically the same as d_model), the number of attention head dimensions (d_head = d_model / h, where h the number of the attention heads), the fully connected layer size (often 4 * d_model), etc.
Thank you. The embedding size determining the hidden state size was my missing piece of the puzzle.