DatasetNotFoundError:


How to solve this ?

It’s because upstage removed the “Pretraining_Dataset” from their huggingface page. See that no such dataset exist here https://huggingface.co/upstage/datasets

You can download the original dataset instead. Make sure you use streaming mode since the original dataset is too big to download entirely. Then you can slice to get the first 60K rows. I chose 600 rows in the screenshot here just to save download time. Note that this dataset might not match upstage’s sample exactly because the speaker didn’t mention exactly how they did sampling from the original 1T dataset.

import datasets
ptds_stream = datasets.load_dataset(
    "togethercomputer/RedPajama-Data-1T",
    'default',
    split='train',
    streaming=True
)
# print(next(iter(ptds_stream)))
pretraining_dataset = list(ptds_stream.take(600))
print(len(pretraining_dataset))

from datasets import Dataset
# Convert to Hugging Face Dataset (non-streaming)
pretraining_dataset = Dataset.from_list(pretraining_dataset)

2 Likes

Reference screenshot here: No dataset called “Pretraining_Dataset” on upstage’s huggingface page.

Then the lesson2 in this short course needs to be fixed right ?

Yeah, it’s 2026 and it has not been fixed yet :melting_face:

1 Like

Hello!

We, along Upstage, have fixed the issue! The datasets are back and the notebook is working as expected.

Thank you for reporting this! Looking forward what you keep learning and building!

-- Lesly, DLAI

1 Like