Decoder-only architectures for sequence modelling

gursi26 · August 12, 2023, 2:39am

This is just a general question regarding decoder-only models and sequence modelling, not specific to any programming assignment from this course.

Let’s say our dataset is a single sentence that goes like so:
“I like to cycle on the weekends”

When tokenizing this sentence, do we split it up strictly using n-grams or do we split it up to a certain n with n-grams and add padding tokens to fill the remaining spaces?
Example:
3-grams:
Input: “I like to”, Target: “cycle”
Input: “like to cycle”, Target: “on”
Input: “to cycle on”, Target: “the”
…
up to 3-grams:
Input: “I ”, Target: “like”
Input: “I like ”, Target: “to”
Input: “I like to”, Target: “cycle”
Input: “like ”, Target: “to”
Input: “like to ”, Target: “cycle”
Input: “like to cycle”, Target: “on”

which of the above tokenization schemes are used?

Is masking used to occlude future words in decoder-only sequence modelling? i.e. does each word only attend to the words that occur before it?

gent.spah · August 12, 2023, 7:05am

Padding is used so the input to the model is consistent and it will be used always.

For the second point masking can have different purposes from preventing look ahead words to also testing prediction performance, depends on the model architecture.

Topic		Replies	Views
Predicting Next Set of Tokens in Decoder Model Generative AI with Large Language Models week-1	7	570	August 10, 2023
Decoder-only Transformer Training/Inference Sequence Models	3	621	June 6, 2023
Transformer decoder architecture in course 2 NLP with Attention Models week-2	11	430	April 30, 2024
NLP C3 W3 E4: Mask NLP with Sequence Models week-3	4	609	April 4, 2023
The tokens that decoder block use Sequence Models week-4	3	206	April 15, 2024

Decoder-only architectures for sequence modelling

Related topics