It’s explained in the lab that we “need to align the start and end indices with the tokens associated with the target answer word”, but the code provided is:
# Start/end character index of the answer in the text.
gold_text = sample["document_plaintext"][sample['annotations.minimal_answers_start_byte'][0]:sample['annotations.minimal_answers_end_byte'][0]]
start_char = sample["annotations.minimal_answers_start_byte"][0]
end_char = sample['annotations.minimal_answers_end_byte'][0] #start_char + len(gold_text)
# sometimes answers are off by a character or two – fix this
if sample['document_plaintext'][start_char-1:end_char-1] == gold_text:
start_char = start_char - 1
end_char = end_char - 1 # When the gold label is off by one character
elif sample['document_plaintext'][start_char-2:end_char-2] == gold_text:
start_char = start_char - 2
end_char = end_char - 2 # When the gold label is off by two characters
which seem to be doing nothing as we are comparing the string from sample[‘document_plaintext’] rather than comparing it with the text extracted from the tokenized input to check if the tokenizing has caused any misalignment.
Am I missing something here? HERE IS A LINK TO THE LAB IN QUESTION
QUESTION 2:
If we have already mapped the dataset with the mapping function which returns the following:
I have not gone deep in the code, but just glancing at it I see that it does “something” - it checks the plain texts and note the later lines that proceed:
# Start/end character index of the answer in the text.
gold_text = sample["document_plaintext"][sample['annotations.minimal_answers_start_byte'][0]:sample['annotations.minimal_answers_end_byte'][0]]
start_char = sample["annotations.minimal_answers_start_byte"][0]
end_char = sample['annotations.minimal_answers_end_byte'][0] #start_char + len(gold_text)
# sometimes answers are off by a character or two – fix this
if sample['document_plaintext'][start_char-1:end_char-1] == gold_text:
start_char = start_char - 1
end_char = end_char - 1 # When the gold label is off by one character
elif sample['document_plaintext'][start_char-2:end_char-2] == gold_text:
start_char = start_char - 2
end_char = end_char - 2 # When the gold label is off by two characters
start_token = tokenized_data.char_to_token(start_char)
end_token = tokenized_data.char_to_token(end_char - 1)
The last two lines “adjusts” where in the tokenized data the start and end characters are.
The processed_train_data contains other columns. The processing function returns four of them. In other words, the columns of processed_train_data are:
Thanks for the prompt response. The question that I (and I think Wissam) have is that in the statement:
if sample[‘document_plaintext’][start_char-1:end_char-1] == gold_text:
gold_text is itself derived from sample[‘document_plaintext’] so barring some extremely unusual situations, this test should almost never test true and therefore not really do anything. Similarly for the elif branch.
If the gold_text were to be based off the tokenized text, then that would be a different matter.
Again, I have not delved deep, but I suspect that the issue is that “plain text” is not always “plain text” (plain text) as it is encoded with bytes and different encodings (ASCII, UTF) might introduce some nuances. So I suspect this deals with these situations (which might never be an issue or might be very important, you would have to dig deeper).
As I understand, the idea is to “slightly adjust” the start and end tokens based on the “plain text” (if processing slightly shifted them) rather than after the tokenizer did something to them (because annotations are based on the “plain text” rather than what we have done to it).
Does that make sense? Or is the issue more nuanced and I don’t quite see it?
I hope you went through the lab carefully as the shared codes by @Wissam is tokenised by using Autotokenizer from the Transformer Library
When loading a tokenizer with any method, you must pass the model checkpoint that you want to fine-tune. Here, you are using the'distilbert-base-cased-distilled-squad' checkpoint.
the above statement is from the codes you are having out on.
So the plaint_text is basically tokenised. Also if you click at the beginning where it mentioned about this data, one would understand more about this gold_text, I am sharing a link here, so you can have a look.
Once you view the above link you would understand how the test would test true if the text is true for gold_text.
Basically the dataset has multiple language translation to English translation where are texts are marked and in this lab it is basically fine-tuning the transformer models.
I looked at the code again and it does seem that you’re right. The code does not check the tokenized_data for misalignment - it only checks the string against itself which I’m sure is not the intended behavior.