Decoder-only architectures for sequence modelling

This is just a general question regarding decoder-only models and sequence modelling, not specific to any programming assignment from this course.

Let’s say our dataset is a single sentence that goes like so:
“I like to cycle on the weekends”

  1. When tokenizing this sentence, do we split it up strictly using n-grams or do we split it up to a certain n with n-grams and add padding tokens to fill the remaining spaces?
    Input: “I like to”, Target: “cycle”
    Input: “like to cycle”, Target: “on”
    Input: “to cycle on”, Target: “the”

    up to 3-grams:
    Input: “I ”, Target: “like”
    Input: “I like ”, Target: “to”
    Input: “I like to”, Target: “cycle”
    Input: “like ”, Target: “to”
    Input: “like to ”, Target: “cycle”
    Input: “like to cycle”, Target: “on”

which of the above tokenization schemes are used?

  1. Is masking used to occlude future words in decoder-only sequence modelling? i.e. does each word only attend to the words that occur before it?

Padding is used so the input to the model is consistent and it will be used always.

For the second point masking can have different purposes from preventing look ahead words to also testing prediction performance, depends on the model architecture.