Where does trained with 'X billion parameter' fit in the encoder/decoder architecture?

Having gone through the basics about architecture in week 1 I have 2 basic questions. Kindly advice if anyone know the answers (or may be it’s mentioned in some lessons but I missed to notice it) :

  1. If I understand correctly, Encoder-interprets the meaning of input and Decoder -predicts the output tokens. Where do they get the training for this in the first place? Is it a different architecture / process?
  2. What does it mean when we commonly hear model X was trained in Y billion parameters? Which part of the architecture discussed in week 1 contributes to this and how?

HI @nanospeck ,

Not sure I fully understand your question 1 but here’s an attempt to answer it:

Lets stablish first that:

  • Some models are only encoders
  • Some models are only decoders
  • Some models are both encoder-decoder

Depending on the configuration, the training happens all at the same time. If we are talking about an encoder-decoder model, each pass of the training goes from the input of the encoder all the way to to output of the decoder. For a decoder-only model, the training happens from the input of the decoder to the output of the decoder on each cycle of the training. And this cycles happen for a number of what’s called ‘epochs’. At the end of these epochs we should have a trained model.

As for your 2nd question: The parameters are mainly in the weights and biases of the model, which are in matrices of the different layers of the model. More specifically, in a transformer architecture, the weights and biases are located in:

  1. The self attention section: each model has a number of heads and each head has 3 weights matrices: query, key, value. The size of these q, k, v matrices is a hyperparameter. A transformer can have multiple heads. The original paper suggests 8.

  2. Feed Forward network: This happens after the self-attention mechanism. Here you’ll usually find 2 linear transformations with some activation function, typically a ReLU function.

  3. Normalization layers: Happen on different places, sometimes after, sometimes before and after , of the FF section. Here you’ll find more weights.

Both the encoder and the decoder have more or less these same components,

So the ‘billions’ of parameters come from the weights in the matrices of these components, and the exact size comes from the dimensions given to the model by their architectures.

Hope this sheds some light to your question!

1 Like

Thank you very much for you response @Juan_Olano . I guess I was being too impatient, question 1 is relevant in the week 1 lesson about pre-training. Just watched it now!

Thanks for the elaborate answer on question 2. It really helps and just what I was looking for. Hearing the word parameter (with no ML background) misled me and wonder if it is some length of corups of text data used to train the model. Thanks for clarifying it’s correct meaning.