C4_W3_2 Fine tuning tydiqa

Marios_Lioutas · May 30, 2023, 3:25pm

On week 3 of Natural Language Processing with Attention Models, at C4_W3_2_Question_Answering_with_BERT_and_HuggingFace_Pytorch_tydiqa exercise, we perform a Fine-tuning task of a BERT model with tydiqa dataset.
The BERT model is the distilbert-base-cased-distilled-squad, which is a pre-trained model, with squad dataset (https://huggingface.co/datasets/squad).
This squad dataset consists of a particular schema-structure:

id: a string feature.
title: a string feature.
context: a string feature.
question: a string feature.
answers: a dictionary feature containing:
- text: a string feature.
- answer_start: a int32 feature.

However, in this notebook, when we retrain our model, our processed_train_data has a different schema with keys: ‘input_ids’,‘attention_mask’, ‘start_positions’, ‘end_positions’, ‘gold_text’ etc

I understand, that actually there is the same kind of information on our processed_train_data, as in squad dataset, but in different structure.
I also understand that the squad dataset was pre-processed, for fine-tuning on distilbert-base-cased-distilled-squad . So my first question is where can I find exact information for what is the expected input dataset for this particular task?
I can not find proper information here (distilbert-base-cased-distilled-squad · Hugging Face).

Secondly, I our tydiqa data, we only have a few samples, where a short answer is given (gold_text). Shouldn’t we use only those samples that do have this gold_text content for this particular fine-tuning task?

Thank you,

Marios

canxkoz · May 31, 2023, 7:39am

Dear @Marios_Lioutas
Welcome to the discourse community. Thanks a lot for asking this question.

To answer your first question, the expected input dataset for the fine-tuning task with the distilbert-base-cased-distilled-squad model can be found in the Hugging Face model card for the model link. However, the model card does not provide explicit details on the pre-processing steps used to convert the SQuAD dataset into the format used for fine-tuning. You can refer to the Hugging Face Transformers documentation on fine-tuning with custom datasets link for a general idea of how to preprocess and fine-tune a model with a custom dataset.Regarding your second question, when fine-tuning the model on the tydiqa dataset, it is indeed a good idea to use only those samples that have a gold_text (short answer) for the task. This is because the model is being fine-tuned for a question-answering task, and samples without a gold_text might not provide useful information for the model to learn from. However, you can also experiment with different strategies, such as using samples without gold_text as negative examples, to see if it improves the model’s performance.

Since you are using Python and TensorFlow, you can use the Hugging Face Transformers library to fine-tune the model. Here’s an example of how to load the pre-trained distilbert-base-cased-distilled-squad model using the library:

from transformers import TFDistilBertForQuestionAnswering

model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')

For fine-tuning, you can follow the general guidelines provided in various tutorials and articles, such as the one on fine-tuning DistilBERT for binary classification tasks link, or the one on fine-tuning BERT for text classification link. Although these examples are for classification tasks, you can adapt the code and techniques for the question-answering task with the tydiqa dataset.
Please feel free to ask a followup question if you feel uncertain.
Best,
Can

Topic		Replies	Views
Clarification about Course 4 Week 3 HW NLP with Attention Models week-module-3	2	596	May 6, 2022
Fine Tuning Bert Lecture Video is Confusing NLP with Attention Models week-module-3	1	319	December 21, 2023
[Week 4] Transformer Network Application: Named-Entity Recognition Sequence Models coursera-platform	11	794	July 21, 2021
Why does BERT trained on TyDi QA dataset is able to do a better job at answering the question on comics books? NLP with Attention Models week-module-3	4	500	February 7, 2023
Challenge with fine_tuning hugging face model NLP with Attention Models week-module-3	2	350	October 12, 2023

C4_W3_2 Fine tuning tydiqa

Related topics