DatasetNotFoundError:


How to solve this ?

It’s because upstage removed the “Pretraining_Dataset” from their huggingface page. See that no such dataset exist here https://huggingface.co/upstage/datasets

You can download the original dataset instead. Make sure you use streaming mode since the original dataset is too big to download entirely. Then you can slice to get the first 60K rows. I chose 600 rows in the screenshot here just to save download time. Note that this dataset might not match upstage’s sample exactly because the speaker didn’t mention exactly how they did sampling from the original 1T dataset.

import datasets
ptds_stream = datasets.load_dataset(
    "togethercomputer/RedPajama-Data-1T",
    'default',
    split='train',
    streaming=True
)
# print(next(iter(ptds_stream)))
pretraining_dataset = list(ptds_stream.take(600))
print(len(pretraining_dataset))

from datasets import Dataset
# Convert to Hugging Face Dataset (non-streaming)
pretraining_dataset = Dataset.from_list(pretraining_dataset)

Reference screenshot here: No dataset called “Pretraining_Dataset” on upstage’s huggingface page.

Then the lesson2 in this short course needs to be fixed right ?