Hi @Deepti_Prasad
Wow, these are good questions
The link I provided was meant to be simple because the OP mentioned of being newbie. The original paper The Curse Of Recursion: Training On Generated Data Makes Models Forget is much richer in details.
I bet the people will come up with ways to tackle this problem (the “pollution” of the dataset with AI’s content).
I suppose you’re joking. But to be on the safe side - yes, never “rely” on the internet, let alone AI generated content (which is trained on the internet in the first place ). Don’t get me wrong, I’m not saying that these tools are not useful (they are), but “relying” is a bit a strong word.
I’m not sure what you mean, but exploratory analysis would also suffer from AI generated content - the problem to distinguish what is AI and what is not is difficult. AI generated content can and do shift the underlying distributions.
They mention in the paper that it is unclear how content generated by LLMs can be tracked at scale.
OpenAI gave up on trying to distinguish between AI-written and human-written text on July 30, 2023:
As of July 20, 2023, the AI classifier is no longer available due to its low rate of accuracy. We are working to incorporate feedback and are currently researching more effective provenance techniques for text, and have made a commitment to develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated.
So I don’t think there is a solution in the near future.
I agree with your point - I think we (humans) have an advantage of operating in the real world with lots of different stimuli.
I can imagine being in coma with absence of all senses and only having my own thoughts would be something of “model collapse”.
Kids need interacting with environment (toys, different experiences and also others like mothers to help with learning) this way they can build the model (and I think way more sophisticated than current LLMs) of the world but I guess in absence of further experiences would also make their model collapse.
In that regard, Reinforcement Learning is more like that, but it also has its share of problems.
So yeah, I think there are a lot of smart people that are coming up with different ideas and hopefully will come up with something
Cheers