C5 W4 Lab 2 and 3, tokenizer

I have difficulty in understanding how the tokenizer function works.

In lab 3, we see

encoding = tokenizer(example[‘sentences’], example[‘question’], truncation=True, padding=True, max_length=tokenizer.model_max_length)

but when the method is called, it is applied to the entire dataset (processed). How does it figure out that it has to go one level down and apply it on processed[train] and processed[test] separately? does it assume that the dataset must have these two fields (train and test)? or it goes down one level anyways?

when I compare this line with a similar line in Lab 2:

tokenized_inputs = tokenizer(examples, truncation=True, is_split_into_words=False, padding=‘max_length’, max_length=512)

in lab 2, it takes one input argument as “examples” whereas in lab 3, it take two of these. does the tokenizer simply concatenate them?

Aff

For the 1st question, tokenizer is applied to processed by the code snippet:

qa_dataset = processed.map(tokenize_align)

The type of processed is DatasetDict, its map function will apply transformation to all the datasets of the dataset dictionary. So, all datasets (test and train) in processed will apply tokenizer.

Regarding the 2nd question, BERT has two types of inputs, single sentence (e.g., NER, sentiment analysis) and pair of sentences (e.g., question answering). BERT tokenizer supports both, too. For single sentence input, just ignore the 2nd parameter. Here is function definition.