I believe there is a critical issue with the GSM8K dataset provided in the environment for this assignment.
When loading the dataset using:
from datasets import load_from_disk
ds = load_from_disk("/app/data/gsm8k")
The object loads as a DatasetDict, but both the train and test splits contain only 1 row each, and the features are not the actual GSM8K fields. Instead, they are internal metadata fields:
['_data_files', '_fingerprint', '_format_columns', '_format_kwargs',
'_format_type', '_output_all_columns', '_split']
This means the dataset does not contain the real GSM8K samples (e.g., question, answer or input, output). It appears that the dataset was saved incorrectly, likely by serializing a wrapper or metadata object instead of the actual dataset contents.
As a result:
-
len(gsm8k_dataset)returns 1 instead of thousands of samples -
Any evaluation code using
.select(range(num_samples))fails withIndexError -
Model evaluation accuracy is meaningless because it is computed on an invalid dataset
This is not a coding mistake in the notebook, but a problem with the dataset artifact shipped with the assignment.
Expected behavior:
The directory /app/data/gsm8k should contain a properly saved HuggingFace dataset created via Dataset.save_to_disk(), with real GSM8K fields such as:
-
question/answer
or -
input/output
and with the correct number of rows in each split.
Impact:
This issue prevents correct implementation and evaluation of Exercises 2 and 3, since the dataset itself is invalid.
I recommend rebuilding and re-uploading the GSM8K dataset used by the environment.
