C5 W4 Lab 2 and 3, tokenizer

AffDk · August 6, 2021, 9:56am

I have difficulty in understanding how the tokenizer function works.

In lab 3, we see

encoding = tokenizer(example[‘sentences’], example[‘question’], truncation=True, padding=True, max_length=tokenizer.model_max_length)

but when the method is called, it is applied to the entire dataset (processed). How does it figure out that it has to go one level down and apply it on processed[train] and processed[test] separately? does it assume that the dataset must have these two fields (train and test)? or it goes down one level anyways?

when I compare this line with a similar line in Lab 2:

tokenized_inputs = tokenizer(examples, truncation=True, is_split_into_words=False, padding=‘max_length’, max_length=512)

in lab 2, it takes one input argument as “examples” whereas in lab 3, it take two of these. does the tokenizer simply concatenate them?

Aff

edwardyu · August 9, 2021, 3:07am

For the 1st question, tokenizer is applied to processed by the code snippet:

qa_dataset = processed.map(tokenize_align)

The type of processed is DatasetDict, its map function will apply transformation to all the datasets of the dataset dictionary. So, all datasets (test and train) in processed will apply tokenizer.

Regarding the 2nd question, BERT has two types of inputs, single sentence (e.g., NER, sentiment analysis) and pair of sentences (e.g., question answering). BERT tokenizer supports both, too. For single sentence input, just ignore the 2nd parameter. Here is function definition.

Topic		Replies	Views
[SOLVED] Potential issue with tokenize_function in week2 lab Generative AI with Large Language Models week-2	1	133	May 26, 2024
04_Data_preparation_lab_student - tokenize_function Finetuning Large Language Models	0	122	February 7, 2024
Build_dataset.tokenize logic could be simper Generative AI with Large Language Models feedback , week-3	1	171	May 2, 2024
Fine Tuning BERT collab NLP with Attention Models week-3	1	472	May 30, 2023
Wk 4, Lab 2: token_list = tokenizer.texts_to_sequences([line])[0] Natural Language Processing in TensorFlow week-4	3	283	March 5, 2023

C5 W4 Lab 2 and 3, tokenizer

Related topics