C4_W3_2 Fine tuning tydiqa

On week 3 of Natural Language Processing with Attention Models, at C4_W3_2_Question_Answering_with_BERT_and_HuggingFace_Pytorch_tydiqa exercise, we perform a Fine-tuning task of a BERT model with tydiqa dataset.
The BERT model is the distilbert-base-cased-distilled-squad, which is a pre-trained model, with squad dataset (https://huggingface.co/datasets/squad).
This squad dataset consists of a particular schema-structure:

  • id: a string feature.
  • title: a string feature.
  • context: a string feature.
  • question: a string feature.
  • answers: a dictionary feature containing:
    • text: a string feature.
    • answer_start: a int32 feature.

However, in this notebook, when we retrain our model, our processed_train_data has a different schema with keys: ‘input_ids’,‘attention_mask’, ‘start_positions’, ‘end_positions’, ‘gold_text’ etc

I understand, that actually there is the same kind of information on our processed_train_data, as in squad dataset, but in different structure.
I also understand that the squad dataset was pre-processed, for fine-tuning on distilbert-base-cased-distilled-squad . So my first question is where can I find exact information for what is the expected input dataset for this particular task?
I can not find proper information here (distilbert-base-cased-distilled-squad · Hugging Face).

Secondly, I our tydiqa data, we only have a few samples, where a short answer is given (gold_text). Shouldn’t we use only those samples that do have this gold_text content for this particular fine-tuning task?

Thank you,

Marios

Dear @Marios_Lioutas
Welcome to the discourse community. Thanks a lot for asking this question.

To answer your first question, the expected input dataset for the fine-tuning task with the distilbert-base-cased-distilled-squad model can be found in the Hugging Face model card for the model link. However, the model card does not provide explicit details on the pre-processing steps used to convert the SQuAD dataset into the format used for fine-tuning. You can refer to the Hugging Face Transformers documentation on fine-tuning with custom datasets link for a general idea of how to preprocess and fine-tune a model with a custom dataset.Regarding your second question, when fine-tuning the model on the tydiqa dataset, it is indeed a good idea to use only those samples that have a gold_text (short answer) for the task. This is because the model is being fine-tuned for a question-answering task, and samples without a gold_text might not provide useful information for the model to learn from. However, you can also experiment with different strategies, such as using samples without gold_text as negative examples, to see if it improves the model’s performance.

Since you are using Python and TensorFlow, you can use the Hugging Face Transformers library to fine-tune the model. Here’s an example of how to load the pre-trained distilbert-base-cased-distilled-squad model using the library:

from transformers import TFDistilBertForQuestionAnswering

model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')

For fine-tuning, you can follow the general guidelines provided in various tutorials and articles, such as the one on fine-tuning DistilBERT for binary classification tasks link, or the one on fine-tuning BERT for text classification link. Although these examples are for classification tasks, you can adapt the code and techniques for the question-answering task with the tydiqa dataset.
Please feel free to ask a followup question if you feel uncertain.
Best,
Can