Fine Tuning BERT collab

Hi @arvyzukai ,

I am unable to understand the meaning of following preprocessing steps in the second collab of BERT fine tunning-

  1. When there is no answer to a question given a context, you will use the CLS token, a unique token used to represent the start of the sequence. Why do we need CLS here?
  2. Tokenizers can split a given string into substrings, resulting in a subtoken for each substring, creating misalignment between the list of dataset tags and the labels generated by the tokenizer. Therefore, you will need to align the start and end indices with the tokens associated with the target answer word.
  3. Finally, a tokenizer can truncate a very long sequence. So, if the start/end position of an answer is None, you will assume that it was truncated and assign the maximum length of the tokenizer to those positions.

Can you shed some light on this with any example?

Hi @Aaditya1

I’m not sure, but you would have to look at the dataset carefully why is this option was chosen. My intuition is that they did not want to penalize the model for answering questions that do not exist. But again, I could be wrong here, it’s a good question :+1: and it needs time to investigate. I hope you will do it yourself and provide us with the answer :slight_smile:

As you saw in the assignment, the word ‘Beginners’ can be tokenized by using two substrings, for example into [12847, 277] (‘beginn’ and ‘ers’). This approach is somewhat between tokenizing by using only character or by using only whole words. In practice, it’ the best choice.

Again, I’m not sure, but my intuition is that they chose this approach to not penalize the model for rare cases when the context is too long for the answer to be generated. I guess they could have cleaned the dataset in the first place (when they qoute: “filtered version with only English examples”) but maybe they wanted to control what would happen at inference time when you provide a too long of a sequence. I’m not sure, you can dig deeper to find out :slight_smile: