Business

AI industry in the midst of a debate over fake data

The world of AI is on the verge of exhausting its most precious resource — and that’s leading industry leaders into a fierce debate over a fast-growing alternative touted as a replacement: synthetic data, or essentially “fake” data.

For years, companies like OpenAI and Google have been mining data from the internet to train the large language models (LLMs) that power their AI tools and features. These LLMs ingest volumes of text, video, and other online media produced by humans over centuries — everything from research papers to novels to YouTube clips.

Today, the supply of “real,” human-generated data is running out. Research firm Epoch AI predicts that text data could run out by 2028. At the same time, companies that have mined every corner of the internet for actionable training data—sometimes breaking their policies to do so—are facing increased restrictions on what’s left.

For some, this isn’t necessarily a problem. Sam Altman, CEO of OpenAI, has argued that AI models should eventually produce synthetic data that’s good enough to train themselves effectively. The appeal is obvious: Training data has become one of the most valuable resources in the rise of AI, and the ability to generate it cheaply and in seemingly infinite ways is tantalizing.

Researchers still debate whether synthetic data is the silver bullet, with some arguing that this path could lead AI models to poison themselves with poor-quality information and “collapse” as a result.

A recent study published by a group of researchers from Oxford and Cambridge showed that feeding a model with AI-generated data causes it to produce gibberish. AI-generated data is not unusable for training, the authors say, but it needs to be balanced with real data.

As human-generated data dries up, more and more companies are turning to synthetic data. In 2021, research firm Gartner predicted that by 2024, 60% of the data used to develop AI will be synthetically generated.

“It’s a crisis,” said Gary Marcus, an artificial intelligence analyst and professor emeritus of psychology and neuroscience at New York University. “People thought you could improve large language models infinitely by just using more and more data, but now they’ve used all the data they can.”

“Yes, it will help you solve some problems, but the deeper problem is that these systems don’t really reason, they don’t really plan,” Marcus added. “All the synthetic data you can imagine is not going to solve this fundamental problem.”

More and more companies are creating synthetic data

The need for “fake” data is based on the idea that real-world data is rapidly running out.

This is partly because tech companies have been rushing to use publicly available data to train AI to outperform their competitors. It is also because online data owners are increasingly wary of companies stealing their data for free.

OpenAI researchers revealed in 2020 how they used free data from Common Crawl, a web crawler that contains “nearly a trillion words” from online resources, to train the AI ​​model that would eventually power ChatGPT.

A study published in July by MIT’s Data Provenance Initiative found that websites are now putting restrictions in place to prevent AI companies from using data they don’t own. News publications and other prominent sites are increasingly preventing AI companies from freely copying their data.

To get around this problem, companies like OpenAI and Google are writing checks worth tens of millions of dollars to access data from Reddit and news media, which serve as conduits of fresh data for training models. But even this method has its limits.

“There are no longer major areas of the textual web just waiting to be reclaimed,” Nathan Lambert, a researcher at the Allen Institute for AI, wrote in May.

This is where synthetic data comes in. Rather than being extracted from the real world, synthetic data is generated by AI systems that have been trained on real-world data.

In June, for example, Nvidia released an AI model that can create artificial datasets for training and alignment. In July, researchers at Chinese tech giant Tencent created a synthetic data generator called Persona Hub that does a similar job.

Some startups, like Gretel and SynthLabs, are even being launched with the sole purpose of generating and selling troves of specific types of data to companies that need them.


A phone featuring Meta's Llama 3 AI

A cat powered by Meta’s Llama 3 AI model.

Anadolu/Getty Images



Proponents of synthetic data make valid arguments to justify their use. As in the real world, human-generated data is often messy, leaving researchers with the complex and laborious task of cleaning and labeling it before it can be used.

Synthetic data has the potential to fill gaps that human data can’t. In late July, Meta unveiled Llama 3.1, a new series of AI models that generate synthetic data and use it to “fine-tune” training. It has used the data to improve performance in specific skills, such as coding in languages ​​like Python, Java, and Rush, as well as solving math problems.

Synthetic training could be particularly effective for small AI models. Last year, Microsoft said it gave OpenAI’s models a diverse list of words that a 3- to 4-year-old would know, then asked them to generate short stories using that data. The resulting dataset was used to create a group of small but powerful language models.

Synthetic data can also help effectively counteract biases produced by real-world data. In their 2021 paper, “On the Dangers of Stochastic Parrots,” former Google researchers Timnit Gebru, Margaret Mitchell, and others argued that LLMs trained on massive datasets of text from the internet would likely reflect biases in the data.

In April, a group of researchers from Google DeepMind published a paper advocating the use of synthetic data to address data sparsity and privacy issues in training, adding that ensuring the accuracy and absence of bias in such AI-generated data “remains a critical challenge.”

“The Habsburg AI”

While the AI ​​industry has found some benefits in synthetic data, it faces serious issues that it cannot afford to ignore, such as the fear that synthetic data could completely destroy AI models.

In Meta’s research paper on Llama 3.1, the company said that training the 405 billion parameter version of the latest model “on its own generated data is not useful” and may even “degrade performance.”

A new study published in the journal Nature last month found that “indiscriminate use” of synthetic data in model training can lead to “irreversible flaws.” The researchers called this phenomenon “model collapse” and warned that the problem must be taken seriously “if we are to preserve the benefits of training on large-scale data scraped from the web.”

Jathan Sadowski, a senior researcher at Monash University, coined a term for this idea: Habsburg AI, after the Austrian dynasty that some historians believe self-destructed through inbreeding. Since coining the term, Sadowski told BI he feels validated by the research that supports his claim that models heavily trained on AI output can mutate.

“The question for researchers and companies developing AI systems is how much synthetic data is too much,” Sadowski said. “They need to find every solution they can to overcome the challenges of data scarcity for AI systems, even if those solutions are short-term fixes that could do more harm than good by creating poor-quality systems.”

However, results from a paper published in April showed that models trained on their own generated data are not necessarily doomed to “collapse” if they are trained with both “real” and synthetic data. Now, some companies are betting on a future of “hybrid data,” where synthetic data is generated using real data in an effort to keep the model from going off the rails.

Scale AI, which helps companies label and test data, said the company is exploring “the direction of hybrid data,” using both synthetic and non-synthetic data (Scale AI CEO Alexandr Wang recently said, “Hybrid data is the real future”).

Looking for other solutions

AI may require entirely new approaches, as simply feeding more data into models may not be enough.

A group of researchers at Google DeepMind may have proven the merits of another approach in January when the company announced AlphaGeometry, an AI system capable of solving Olympiad-level geometry problems.

In a companion paper, the researchers explain how AlphaGeometry uses a “neuro-symbolic” approach, which combines the strengths of other AI approaches, falling somewhere between data-intensive deep learning models and rule-based logical reasoning. The IBM research group said it sees it as “a path toward artificial general intelligence.”

Additionally, in the case of AlphaGeometry, it was pre-trained on entirely synthetic data.

The neurosymbolic field of AI is still relatively young, and it remains to be seen whether it will drive AI forward.

Given the pressures companies like OpenAI, Google, and Microsoft face to turn AI hype into profits, we can expect them to try every possible solution to the data crisis.

“People thought you could improve large language models infinitely by simply using more and more data, but now they’ve used almost all the data possible,” Marcus says. “We’re going to be stuck here again if we don’t adopt new approaches.”

Back to top button