I have difficulty in understanding how the tokenizer function works.
In lab 3, we see
encoding = tokenizer(example[‘sentences’], example[‘question’], truncation=True, padding=True, max_length=tokenizer.model_max_length)
but when the method is called, it is applied to the entire dataset (processed). How does it figure out that it has to go one level down and apply it on processed[train] and processed[test] separately? does it assume that the dataset must have these two fields (train and test)? or it goes down one level anyways?
when I compare this line with a similar line in Lab 2:
tokenized_inputs = tokenizer(examples, truncation=True, is_split_into_words=False, padding=‘max_length’, max_length=512)
in lab 2, it takes one input argument as “examples” whereas in lab 3, it take two of these. does the tokenizer simply concatenate them?
Aff