Fine Tuning BERT collab

Aaditya1 · May 30, 2023, 5:23pm

I am unable to understand the meaning of following preprocessing steps in the second collab of BERT fine tunning-

When there is no answer to a question given a context, you will use the CLS token, a unique token used to represent the start of the sequence. Why do we need CLS here?
Tokenizers can split a given string into substrings, resulting in a subtoken for each substring, creating misalignment between the list of dataset tags and the labels generated by the tokenizer. Therefore, you will need to align the start and end indices with the tokens associated with the target answer word.
Finally, a tokenizer can truncate a very long sequence. So, if the start/end position of an answer is None, you will assume that it was truncated and assign the maximum length of the tokenizer to those positions.

Can you shed some light on this with any example?

arvyzukai · May 30, 2023, 5:50pm

I’m not sure, but you would have to look at the dataset carefully why is this option was chosen. My intuition is that they did not want to penalize the model for answering questions that do not exist. But again, I could be wrong here, it’s a good question and it needs time to investigate. I hope you will do it yourself and provide us with the answer

As you saw in the assignment, the word ‘Beginners’ can be tokenized by using two substrings, for example into [12847, 277] (‘beginn’ and ‘ers’). This approach is somewhat between tokenizing by using only character or by using only whole words. In practice, it’ the best choice.

Again, I’m not sure, but my intuition is that they chose this approach to not penalize the model for rare cases when the context is too long for the answer to be generated. I guess they could have cleaned the dataset in the first place (when they qoute: “filtered version with only English examples”) but maybe they wanted to control what would happen at inference time when you provide a too long of a sequence. I’m not sure, you can dig deeper to find out

Cheers

Topic		Replies	Views
Question about the dataset mapping function in C4W3_HF_Lab2_QA_BERT lab NLP with Attention Models week-module-3	6	317	May 20, 2024
[Week 4] BERT Pre-Training Concepts Sequence Models coursera-platform	1	516	December 12, 2022
About packaging data: NEXT SENTECE Pretraining LLMs	0	42	July 19, 2024
C4W1: EOS token has very low probability NLP with Attention Models week-module-1	11	179	July 18, 2024
C5W4 : Ungraded Lab : NER with transformers Sequence Models coursera-platform	1	506	January 11, 2023

Fine Tuning BERT collab

Related topics