Companies collecting data for LLMs

Hi, so i wanted to make a discussion about a disturbing topic that even tho some people in the AI field are aware of it, i don’t see much care about it.

It’s about how big companies collect data for LLMs like Open AI, Meta…etc
We do know that to make such strong models you need some huge data, and one of the ways to do that is doing a lot of web scraping we are not talking about thousands of websites but maybe millions possibly even billions of websites.
It doesn’t seem feasible when you are webscraping millions/billions of websites to check the TOU (Terms Of Use) of each website and act based on the boundaries the websites built, so basically that means a lot of not respecting TOUs of websites will be done!
Also i saw in New York Times article talking about the subject (https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html) they also talked about how google changed their terms to be able to collect data for their AI models …etc i double checked (you can double check if you want i encourage you to do so) so now even policies are being bent to satisfy the need for collecting data.
I am not talking from a legal perspective rather from an ethical perspective, is that okay? and even if that’s true, so we can just use their models and fine tune it like nothing wrong happened?
Also if you think i am wrong please correct me, and if you have alternatives to actually create a LLM that can be as strong as Llama 3, ChatGPT …etc but without crossing the lines when collecting data i would love to hear how!