Newbie Seeking Advice on AI Training Dataset Collection

Hey everyone,

I’m building a storytelling AI using a Transformer decoder-only model, but I have very little experience with datasets and could really use some guidance!
1. What are the fundamental datasets for NLP? (I know there are WordNet, ConceptNet, and vocabulary datasets, but is there a baseline dataset for NLP chatbot?, and where to get it)
2. Using Kaggle datasets: I already have the API set up, but I’m struggling to find the direct URL for downloading datasets. Any tips?
3. Dataset size: If I’m training a small Transformer (~1GB size), what kind of dataset scale should I aim for?
4. Collecting my own data: If I scrape text from sources like Wikipedia, clean it up, and use it—will that work? What legal or technical concerns should I be aware of?

I know some of these questions might be dumb :laughing: (I haven’t taken a data science course yet, but planning to!). Any advice would be super helpful. Thanks a lot! :slight_smile: