GSM8K dataset provided in Exercise is corrupted / incorrectly saved (loads only metadata)

I believe there is a critical issue with the GSM8K dataset provided in the environment for this assignment.

When loading the dataset using:

from datasets import load_from_disk
ds = load_from_disk("/app/data/gsm8k")

The object loads as a DatasetDict, but both the train and test splits contain only 1 row each, and the features are not the actual GSM8K fields. Instead, they are internal metadata fields:

['_data_files', '_fingerprint', '_format_columns', '_format_kwargs',
 '_format_type', '_output_all_columns', '_split']

This means the dataset does not contain the real GSM8K samples (e.g., question, answer or input, output). It appears that the dataset was saved incorrectly, likely by serializing a wrapper or metadata object instead of the actual dataset contents.

As a result:

  • len(gsm8k_dataset) returns 1 instead of thousands of samples

  • Any evaluation code using .select(range(num_samples)) fails with IndexError

  • Model evaluation accuracy is meaningless because it is computed on an invalid dataset

This is not a coding mistake in the notebook, but a problem with the dataset artifact shipped with the assignment.

Expected behavior:

The directory /app/data/gsm8k should contain a properly saved HuggingFace dataset created via Dataset.save_to_disk(), with real GSM8K fields such as:

  • question / answer
    or

  • input / output

and with the correct number of rows in each split.

Impact:

This issue prevents correct implementation and evaluation of Exercises 2 and 3, since the dataset itself is invalid.

I recommend rebuilding and re-uploading the GSM8K dataset used by the environment.

@lesly.zerna

can you please look into this issue, multiple learner’s have reported issue related to this lab, some learner had issue with running the code, and when I had tried to open this lab, I got 502 kernel response. Now this learner seems to have pointing no relation with the dataset used and the output mentioned.

so if someone from staff can address all the issues related to this lab, it will be helpful to learners.

Thank you
DP

FYI, Lesly is not the tech lead for this course.

I have notified the tech leads via a private message, but I forgot to post a reply about that.

I know she is not the l.t. for the course, but she usually reverts to the technical team, and she has been looking around for short courses. So just wanted someone from staff to be notified of this lab issue as multiple learners have reported issue with this lab.

This problem also exists in Exercise 1.

@Maclen_Marvit

it will be more helpful if you create a separate topic with screenshot, so both your and topic creator issues can be addressed, compared and resolved more efficiently for every learner’s benefit.

I also didn’t get what you mean by exercise 1?

Just make sure to explain the issue encountered by you briefly even if you find similar issue threads, so as to get better help from staff when they read your response.

Regards
DP

Module 1: Graded lab

This has already been reported to the course staff.

cc @jan.ravnik @a-zarta

This is solved now. thanks for reportning.

2 Likes

Thanks for fixing it.