When running second cell:
from sklearn.datasets import fetch_20newsgroups
Load the 20 Newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset=‘train’, shuffle=True, random_state=42)
Get the error: HTTPError: HTTP Error 403: Forbidden
When running second cell:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset=‘train’, shuffle=True, random_state=42)
Get the error: HTTPError: HTTP Error 403: Forbidden
hi @xzhang14
I already have reported this issue to staff, please follow the below thread for update on this issue
Replace the failing cell with the following code
from datasets import load_dataset
import pandas as pd
# 1. Load the dataset
dataset = load_dataset(“SetFit/20_newsgroups”)
df = pd.DataFrame(dataset[‘train’])
# 2. Rename ‘label’ to ‘category’
df = df.rename(columns={‘label’: ‘category’})
# 3. Create the target_names mapping manually from the data
# We sort by category ID (0-19) to make sure the index matches the ID
mapping = df[[‘category’, ‘label_text’]].drop_duplicates().sort_values(‘category’)
category_names_list = mapping[‘label_text’].tolist()
# 4. Re-create the Mock object
class MockNewsgroups:
def __init__(self, target_names):
self.target_names = target_names
newsgroups_train = MockNewsgroups(category_names_list)
# 5. Final check
print(f"Success! Dataset Size: {df.shape}“)
print(f"Number of categories: {len(newsgroups_train.target_names)}”)
print(f"Category 0 is: {newsgroups_train.target_names[0]}")