I get an error on dataset = load_dataset(huggingface_dataset_name)

I am trying to execute notebook in lab1. I am stuck at the following up

huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
This produces an error, 
ValueError                                Traceback (most recent call last)
Cell In[18], line 3
      1 huggingface_dataset_name = "knkarthick/dialogsum"
----> 3 dataset = load_dataset(huggingface_dataset_name)

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1767, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   1762 verification_mode = VerificationMode(
   1763     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   1764 )
   1766 # Create a dataset builder
-> 1767 builder_instance = load_dataset_builder(

Please help

Try upgrading datasets library using %pip install -U datasets. Then restart the kernel. In the second run skip installing the libraries cell. Seems to have worked for me.

5 Likes

@Sanket_Panchalwar → where do i run the %pip install -U datasets ?
Can i run it in the same cell as other pip install commands?

Worked perfectly! Thanks so much.

Awesome. That worked. Thanks very much. For the record, it works on dataset version “datasets-2.17.0”

I am still getting following error

Found cached dataset csv (file:///root/.cache/huggingface/datasets/knkarthick___csv/knkarthick–dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)

NotImplementedError Traceback (most recent call last)
Cell In[8], line 3
1 huggingface_dataset_name = “knkarthick/dialogsum”
----> 3 dataset = load_dataset(huggingface_dataset_name)

File /opt/conda/lib/python3.10/site-packages/datasets/load.py:1804, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
1800 # Build dataset for splits
1801 keep_in_memory = (
1802 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
1803 )
→ 1804 ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
1805 # Rename and cast features to match task schema
1806 if task is not None:

File /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1108, in DatasetBuilder.as_dataset(self, split, run_post_process, verification_mode, ignore_verifications, in_memory)
1106 is_local = not is_remote_filesystem(self._fs)
1107 if not is_local:
→ 1108 raise NotImplementedError(f"Loading a dataset cached in a {type(self._fs).name} is not supported.")
1109 if not os.path.exists(self._output_dir):
1110 raise FileNotFoundError(
1111 f"Dataset {self.name}: could not find data in {self._output_dir}. Please make sure to call "
1112 "builder.download_and_prepare(), or use "
1113 “datasets.load_dataset() before trying to access the Dataset object.”
1114 )

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

it still doesnt work

I have the same problem. I applied the above instructions but it failed.

+1. Same problem for me too. No changes to the Week 2 notebook but load dataset step keeps failing.

Getting the same eror on Week2 labs

After several attempts, the below scripts ran successfully. Thanks @Sanket_Panchalwar and @vaxy…

%pip install -U datasets

%pip install --upgrade pip
%pip install --disable-pip-version-check
torch==1.13.1
torchdata==0.5.1 --quiet

%pip install
transformers==4.27.2
datasets==2.17.0
evaluate==0.4.0
rouge_score==0.1.2
loralib==0.1.1
peft==0.3.0 --quiet

“in the second run skip installing the libraries cell.” This part is very important because if you will run those cells then error persists.

Hi everyone! Thank you for reporting this. We are looking into this issue. Will update you as soon as possible. In the meantime, please try Sanket and Derya’s workarounds. Thank you and sorry for the inconvenience!

Can someone please explain in a way that anyone can understand?

What is the exact command to run and where?

What does “in the second run skip installing the libraries cell” mean?

Maybe a screenshot, or something?

+1 how exactly do I skip installing the libraries?

Hi everyone! The issue should now be fixed. If you launch the lab again from the classroom, you should see pip install -U datasets in the 2nd code cell.

After you restart the kernel then dont run “pip install” cell of your jupyter notebook again.

@nik95, that should not be necessary, as the issue has been fixed.

It is not fixed, you posted 5 hours ago and I recently just restarted the lab and it still give the error.

Hi Kevin. Can you post here a screenshot of the pip install cell (usually the 2nd code cell of the lab), and also a screenshot of the error after running the pip installs? I can forward it to the team for checking. Thanks.