AI training data comes at a price only big tech can afford

Data is at the heart of today’s advanced AI systems, but it’s increasingly expensive, putting it out of reach for all but the wealthiest tech companies.

Last year, James Betker, a researcher at OpenAI, wrote a post on his personal blog about the nature of generative AI models and the datasets they are trained on. In it, Betker argued that training data – not the design, architecture or any other characteristics of a model – was the key to increasingly sophisticated and capable AI systems.

“Trained on the same data set for long enough, almost all models converge to the same point,” Betker wrote.

Is Betker right? Is training data the primary determinant of what a model can do, whether it’s answering a question, drawing human hands, or generating a realistic cityscape?

It’s certainly plausible.

Statistical machines

Generative AI systems are essentially probabilistic models – a huge pile of statistics. They guess, based on a large number of examples, which data makes the most “meaning” and where to place them (for example, the word “go” before “to the market” in the sentence “I’m going to the market “). So it seems intuitive that the more examples a model has to use, the better the performance of models trained on those examples.

“It seems like the performance gains are coming from the data,” Kyle Lo, a senior applied researcher at the Allen Institute for AI (AI2), an AI research nonprofit, told TechCrunch, “at least once you have a stable training setup.”

Lo gave the example of Meta’s Llama 3, a text generation model released earlier this year, which outperforms AI2’s OLMo model despite being very similar architecturally. Llama 3 was trained on significantly more data than OLMo, which Lo says explains its superiority over many popular AI benchmarks.

(I will point out here that the benchmarks widely used in the AI ​​industry today are not necessarily the best indicator of a model’s performance, but outside of qualitative tests like ours, they are one of the few measurements available to us continues.)

This is not to say that training on exponentially larger data sets is a surefire path to exponentially better models. The models operate under the “garbage in, garbage out” paradigm, notes Lo, and so data curation and quality are very important, perhaps more than just quantity.

“It is possible that a small model with carefully designed data will outperform a large model,” he added. “For example, the Falcon 180B, a large model, is ranked 63rd on the LMSYS benchmark, while the Llama 2 13B, a much smaller model, is ranked 56th.”

In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher-quality annotations contributed immensely to improving image quality in DALL-E 3, OpenAI’s text-to-image model. , compared to its predecessor DALL-E 2. “I think this is the main source of improvements,” he said. “The text annotations are much better than they were (with DALL-E 2) – it’s not even comparable.”

Many AI models, including DALL-E 3 and DALL-E 2, are trained by asking human annotators to label data so that a model can learn to associate these labels with other observed characteristics of those data. data. For example, a model that has been fed many photos of cats with annotations for each breed will eventually “learn” to associate terms like bobtail And short hair with their distinctive visual features.

Bad behaviour

Experts like Lo worry that the growing emphasis on large, high-quality training data sets will centralize AI development among the few players with billion-dollar budgets who can afford to acquire these sets. Major innovations in synthetic data or fundamental architecture could disrupt the status quo, but none appear to be on the horizon.

“Overall, entities governing content potentially useful for AI development have an incentive to lock down their documents,” Lo said. “And as access to data tightens, we’re essentially blessing a few pioneers in data acquisition and moving up the ladder so that no one else can access the data to catch up.”

Indeed, where the race to collect more training data has not led to unethical (and perhaps even illegal) behavior, such as the covert aggregation of copyright-protected content, he author rewarded tech giants with deep pockets to spend on data licenses.

Generative AI models like those from OpenAI are trained primarily on images, text, audio, videos, and other data – some copyrighted – from public web pages ( including, problematically, those generated by AI). The OpenAIs of the world say fair use protects them from legal retaliation. Many rights holders disagree – but, at least for now, there is little they can do to prevent the practice.

There are many, many examples of generative AI vendors acquiring massive data sets through dubious means in order to train their models. OpenAI reportedly transcribed more than a million hours of YouTube videos without YouTube’s blessing – or the creators’ blessing – to power its flagship GPT-4 model. Google recently expanded its terms of service, in part to be able to leverage public Google Docs, restaurant reviews on Google Maps, and other online materials for its AI products. And Meta reportedly considered risking legal action to train its models on IP-protected content.

Meanwhile, companies large and small rely on workers from third world countries paid just a few dollars an hour to create annotations for training sets. Some of these annotators – employed by massive startups like Scale AI – work days on end to complete tasks that expose them to graphic depictions of violence and bloodshed without any benefits or guarantees of future gigs.

Increasing cost

In other words, even the most honest data deals don’t exactly foster an open and fair generative AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, media libraries and more to train its AI models – a budget far greater than that of most academic research groups, non-profit organizations and startups. Meta went so far as to consider acquiring publisher Simon & Schuster for the rights to e-book excerpts (ultimately, Simon & Schuster sold to private equity firm KKR for $1.62 billion in 2023).

With the AI ​​training data market expected to grow from about $2.5 billion today to nearly $30 billion within a decade, data brokers and platforms are rushing to charge the price the highest – in some cases over the objections of their user bases.

Media library Shutterstock has signed deals with AI providers ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data to organizations such as Google and OpenAI. Few platforms with abundant data accumulated organically over the years do not have There appear to be deals signed with generative AI developers – from Photobucket to Tumblr to the Q&A site Stack Overflow.

This is the platforms’ data for sale – at least depending on what legal arguments you believe. But in most cases, users do not receive a cent of profit. And it harms the broader AI research community.

“Small players will not be able to afford these data licenses and therefore will not be able to develop or study AI models,” Lo said. “I fear this will lead to a lack of independent review of AI development practices. »

Independent efforts

If there’s a ray of sunshine in the darkness, it’s the few independent, nonprofit efforts aimed at creating massive data sets that anyone can use to train a generative AI model.

EleutherAI, a non-profit research group that started as a Discord collective in 2020, is working with the University of Toronto, AI2, and independent researchers to create The Pile v2, a collection of billions of text passages sourced primarily from of the public domain. .

In April, AI startup Hugging Face released FineWeb, a filtered version of Common Crawl – the eponymous dataset maintained by the nonprofit Common Crawl, consisting of billions and billions of web pages – which , according to Hugging Face, improves the model’s performance on many benchmarks.

Some efforts to release open training datasets, like the LAION Group image sets, have run into copyright, data privacy, and other equally serious ethical and legal challenges. But some of the most dedicated data curators are committed to doing better. The Pile v2, for example, removes problematic copyrighted material found in its precursor dataset, The Pile.

The question is whether any of these open efforts can hope to keep pace with Big Tech. As long as data collection and curation remains a matter of resources, the answer is probably no – at least not until a research breakthrough levels the playing field.


Back to top button