Companies collecting data for LLMs

Kashef · August 26, 2024, 5:00am

Hi, so i wanted to make a discussion about a disturbing topic that even tho some people in the AI field are aware of it, i don’t see much care about it.

It’s about how big companies collect data for LLMs like Open AI, Meta…etc
We do know that to make such strong models you need some huge data, and one of the ways to do that is doing a lot of web scraping we are not talking about thousands of websites but maybe millions possibly even billions of websites.
It doesn’t seem feasible when you are webscraping millions/billions of websites to check the TOU (Terms Of Use) of each website and act based on the boundaries the websites built, so basically that means a lot of not respecting TOUs of websites will be done!
Also i saw in New York Times article talking about the subject (https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html) they also talked about how google changed their terms to be able to collect data for their AI models …etc i double checked (you can double check if you want i encourage you to do so) so now even policies are being bent to satisfy the need for collecting data.
I am not talking from a legal perspective rather from an ethical perspective, is that okay? and even if that’s true, so we can just use their models and fine tune it like nothing wrong happened?
Also if you think i am wrong please correct me, and if you have alternatives to actually create a LLM that can be as strong as Llama 3, ChatGPT …etc but without crossing the lines when collecting data i would love to hear how!

Topic		Replies	Views
Building a large language model: proxy technology support behind data sources AI Discussions ai-discussions	2	30	June 19, 2025
Data Does Not Want to Be Free: Reddit and Stack Overflow ask AI devs to pay for data AI Discussions the-batch , ai-discussions	1	67	May 23, 2023
How long until LLMs can talk to each other? AI Discussions ai-discussions	12	1202	July 13, 2024
Some questions about LLM Training AI Discussions ai-discussions	3	194	March 6, 2024
Data Privacy in LLMs Generative AI with Large Language Models week-module-1	4	647	July 27, 2023

Companies collecting data for LLMs

Related topics