Lab 3 PPO model parameters - how did you get 768?

D_Blady · February 16, 2025, 5:12am

Lab Link: https://www.coursera.org/learn/generative-ai-with-llms/gradedLti/SKFfM/lab-3-fine-tune-flan-t5-with-reinforcement-learning-to-generate-more-positive

In the lab it states that the PPO weights dimension are ‘768’ + 1 (bias). How do we get 768? I don’t see where it explains those dimensions. Are those auto set at some point? How did the author know to say it was that?

v/r,
Denise

Anna_Kay · February 16, 2025, 1:13pm

Hello @D_Blady,

768 is a commonly used embedding size in Transformer models, such as BERT and T5 (and more precisely, their base versions; BERT-base, T5-base, etc - these models are usually released in a number of sizes and with smaller and bigger versions available as well, e.g. 512, 768, 1024, … embedding sizes). In these models, the hidden representations typically match the embedding size.

Generally speaking, the bigger the embedding size, the bigger the model in terms of number of weights and in terms of storage it takes up. A larger embedding size does allow encoding richer representation.

For Flan-T5-Base, you can check the size in the “d_model” parameter here:

And similarly for all other models available through Hugging Face, check the corresponding config.json file

Best

D_Blady · February 16, 2025, 9:28pm

I’m confused. I thought PPO updated the encoder or decoder weights (either the attention heads or the fully connected layer). Does PPO just update the embedding layer in a transformer?

Anna_Kay · February 16, 2025, 11:41pm

According to the lab description (towards the end of section 2.1, after the PPO code):

In this lab, it is in the context of PEFT that only a small part of the model is updated during PPO:

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

A more general approach for the application of PPO would update the whole model, if not stated otherwise, i.e., encoder & decoder, attention heads and fully connected layers.

The reason I mentioned the embedding size earlier, is because in a transformer model, it determines the hidden state size (typically the same as d_model), the number of attention head dimensions (d_head = d_model / h, where h the number of the attention heads), the fully connected layer size (often 4 * d_model), etc.

D_Blady · February 17, 2025, 2:16am

Thank you. The embedding size determining the hidden state size was my missing piece of the puzzle.

Topic		Replies	Views
PPO model parameters Generative AI with Large Language Models week-3	2	318	November 10, 2023
PPO Config model parameter (which model) Generative AI with Large Language Models week-3	3	383	November 11, 2023
Parameter size graphic question Generative AI with Large Language Models week-1	5	400	August 15, 2023
Embedding * W1 + pos_encoding * W2 in positional encoding lab Sequence Models week-4	3	206	April 16, 2024
Positional Embedding: C5W4 Ex2 Deciding the shape of pos embeddings Sequence Models	3	524	February 23, 2023

Lab 3 PPO model parameters - how did you get 768?

Related topics