Hi everyone, I want to finetune a large language model for “Question-Answering” in otakus’ universe. People don’t really have this culture in Mali and I’m pretty sure it will be impossible for me to get my QA pairs here. I’m also thinking about how can I collect those data with a minimum effort and cleaning so I thought of creating a google spreadsheet to collect a lot of QA pairs in the simplest way but I need otakus to fill it out. Is anyone here interested in helping me collect those data?
And if someone has a better idea for data collection than using a shared spreadsheet Please let me know!
Have you seen this ?
That creative
Sure, my first idea was to search a dataset on kaggle but it turns out that those are mainly datasets for recommender systems. Nothing about text, QA, nothing to train an LLM
Yes, but I’m struggling to collect data!
Why can’t you turn details about the dataset into a QA dataset?
It would require too much time, and I wanted to get QA pairs from real otakus, because those are the ones the model would be trained for
How about the following approaches for generating Q&A pairs:
- Provide few shot examples to an LLM and make it generate responses for new content.
- Use a crowd sourcing / freelance platform.
I already thought about the first one, but I didn’t want my data to contains such a pattern. I want data from real manga/anime fan, which is more likely to be unbiased, pertinent and quality data.
And for you second idea, you mean to delegate the task to freelancers that I would pay to collect the data?
Yes. Have you heard of turk ?
No I had never heard of mturk before you, looks interesting! Thank you!
Seems interesting, might add if got something helpful!
Hi i aminterested in helping you out
Thank you for your interest and excuse the late answer. Here is the Google form I’m using to collect data. Appreciate your help! Thanks