This is just a general question regarding decoder-only models and sequence modelling, not specific to any programming assignment from this course.
Let’s say our dataset is a single sentence that goes like so:
“I like to cycle on the weekends”
- When tokenizing this sentence, do we split it up strictly using n-grams or do we split it up to a certain n with n-grams and add padding tokens to fill the remaining spaces?
Example:
3-grams:
Input: “I like to”, Target: “cycle”
Input: “like to cycle”, Target: “on”
Input: “to cycle on”, Target: “the”
…
up to 3-grams:
Input: “I ”, Target: “like”
Input: “I like ”, Target: “to”
Input: “I like to”, Target: “cycle”
Input: “like ”, Target: “to”
Input: “like to ”, Target: “cycle”
Input: “like to cycle”, Target: “on”
which of the above tokenization schemes are used?
- Is masking used to occlude future words in decoder-only sequence modelling? i.e. does each word only attend to the words that occur before it?