Question about the dataset mapping function in C4W3_HF_Lab2_QA_BERT lab

Wissam · February 3, 2024, 9:41am

It’s explained in the lab that we “need to align the start and end indices with the tokens associated with the target answer word”, but the code provided is:

# Start/end character index of the answer in the text.
        gold_text = sample["document_plaintext"][sample['annotations.minimal_answers_start_byte'][0]:sample['annotations.minimal_answers_end_byte'][0]]
        start_char = sample["annotations.minimal_answers_start_byte"][0]
        end_char = sample['annotations.minimal_answers_end_byte'][0] #start_char + len(gold_text)

        # sometimes answers are off by a character or two – fix this
        if sample['document_plaintext'][start_char-1:end_char-1] == gold_text:
            start_char = start_char - 1
            end_char = end_char - 1     # When the gold label is off by one character
        elif sample['document_plaintext'][start_char-2:end_char-2] == gold_text:
            start_char = start_char - 2
            end_char = end_char - 2     # When the gold label is off by two characters

which seem to be doing nothing as we are comparing the string from sample[‘document_plaintext’] rather than comparing it with the text extracted from the tokenized input to check if the tokenizing has caused any misalignment.
Am I missing something here?
HERE IS A LINK TO THE LAB IN QUESTION

QUESTION 2:
If we have already mapped the dataset with the mapping function which returns the following:

return {'input_ids': tokenized_data['input_ids'],
          'attention_mask': tokenized_data['attention_mask'],
          'start_positions': start_position,
          'end_positions': end_position}

Why would we need to specify columns to return like this?

columns_to_return = ['input_ids','attention_mask', 'start_positions', 'end_positions']
processed_train_data.set_format(type='pt', columns=columns_to_return)
processed_test_data.set_format(type='pt', columns=columns_to_return)

Shouldn’t the mapping function do this for us as it is the case in tensorflow datasets?

Thank you in advance to everyone who takes the time to reply and wish you all a great day/night!

Cawnpore_Charlie · May 15, 2024, 1:16am

@Deepti_Prasad @arvyzukai

Please see Wissams questions - I too have the same two question as him.

Thank you,

-Sudip Chahal

arvyzukai · May 15, 2024, 6:20am

Hi @Cawnpore_Charlie , @Wissam

I have not gone deep in the code, but just glancing at it I see that it does “something” - it checks the plain texts and note the later lines that proceed:

        # Start/end character index of the answer in the text.
        gold_text = sample["document_plaintext"][sample['annotations.minimal_answers_start_byte'][0]:sample['annotations.minimal_answers_end_byte'][0]]
        start_char = sample["annotations.minimal_answers_start_byte"][0]
        end_char = sample['annotations.minimal_answers_end_byte'][0] #start_char + len(gold_text)

        # sometimes answers are off by a character or two – fix this
        if sample['document_plaintext'][start_char-1:end_char-1] == gold_text:
            start_char = start_char - 1
            end_char = end_char - 1     # When the gold label is off by one character
        elif sample['document_plaintext'][start_char-2:end_char-2] == gold_text:
            start_char = start_char - 2
            end_char = end_char - 2     # When the gold label is off by two characters

        start_token = tokenized_data.char_to_token(start_char)
        end_token = tokenized_data.char_to_token(end_char - 1)

The last two lines “adjusts” where in the tokenized data the start and end characters are.

The processed_train_data contains other columns. The processing function returns four of them. In other words, the columns of processed_train_data are:

features: ['passage_answer_candidates.plaintext_start_byte', 'passage_answer_candidates.plaintext_end_byte', 'question_text', 'document_title', 'language', 'annotations.passage_answer_candidate_index', 'annotations.minimal_answers_start_byte', 'annotations.minimal_answers_end_byte', 'annotations.yes_no_answer', 'document_plaintext', 'document_url', 'input_ids', 'attention_mask', 'start_positions', 'end_positions'],

and we select the four we need.

Cheers

Cawnpore_Charlie · May 15, 2024, 6:30am

[quote=“arvyzukai, post:3, topic:564344”]

Thanks for the prompt response. The question that I (and I think Wissam) have is that in the statement:

if sample[‘document_plaintext’][start_char-1:end_char-1] == gold_text:

gold_text is itself derived from sample[‘document_plaintext’] so barring some extremely unusual situations, this test should almost never test true and therefore not really do anything. Similarly for the elif branch.

If the gold_text were to be based off the tokenized text, then that would be a different matter.

Hope I am being clear.

Thank you.

arvyzukai · May 15, 2024, 6:45am

gold_text = sample["document_plaintext"][sample['annotations.minimal_answers_start_byte'][0]:sample['annotations.minimal_answers_end_byte'][0]]

Again, I have not delved deep, but I suspect that the issue is that “plain text” is not always “plain text” (plain text) as it is encoded with bytes and different encodings (ASCII, UTF) might introduce some nuances. So I suspect this deals with these situations (which might never be an issue or might be very important, you would have to dig deeper).

As I understand, the idea is to “slightly adjust” the start and end tokens based on the “plain text” (if processing slightly shifted them) rather than after the tokenizer did something to them (because annotations are based on the “plain text” rather than what we have done to it).

Does that make sense? Or is the issue more nuanced and I don’t quite see it?

Deepti_Prasad · May 17, 2024, 1:54am

I hope you went through the lab carefully as the shared codes by @Wissam is tokenised by using Autotokenizer from the Transformer Library

When loading a tokenizer with any method, you must pass the model checkpoint that you want to fine-tune. Here, you are using the'distilbert-base-cased-distilled-squad' checkpoint.

the above statement is from the codes you are having out on.

So the plaint_text is basically tokenised. Also if you click at the beginning where it mentioned about this data, one would understand more about this gold_text, I am sharing a link here, so you can have a look.

https://ai.google.com/research/tydiqa

Once you view the above link you would understand how the test would test true if the text is true for gold_text.

Basically the dataset has multiple language translation to English translation where are texts are marked and in this lab it is basically fine-tuning the transformer models.

Let me know if you still have doubt.

Regards
DP

arvyzukai · May 20, 2024, 7:10am

I looked at the code again and it does seem that you’re right. The code does not check the tokenized_data for misalignment - it only checks the string against itself which I’m sure is not the intended behavior.

I will submit it for corrections.

Cheers

Topic		Replies	Views
Fine Tuning BERT collab NLP with Attention Models week-module-3	1	473	May 30, 2023
C5_w4 upgraded lab named-entity, errors in word identification Sequence Models coursera-platform	2	378	September 11, 2023
C5 W4 Lab 2 and 3, tokenizer Sequence Models coursera-platform	1	548	August 9, 2021
C4W3_Assignment in Natural Language Processing with Attention Models NLP with Attention Models week-module-3	22	400	September 2, 2024
C4W3: Exercise 3 GRADED FUNCTION: answer_question NLP with Attention Models week-module-3	2	55	January 8, 2025

Question about the dataset mapping function in C4W3_HF_Lab2_QA_BERT lab

Related topics