Tech

TTT models could be the next frontier of generative AI

After years of dominance by the form of AI known as Transformer, the hunt is on for new architectures.

Transformers are the foundation of OpenAI’s Sora video generation model and are at the heart of text generation models like Anthropic’s Claude, Google’s Gemini, and GPT-4o. But they’re starting to run into technical hurdles, particularly those related to computation.

Transformers are not particularly efficient at processing and analyzing large amounts of data, at least when running on commodity hardware. This leads to abrupt and perhaps unsustainable increases in energy demand as companies build and expand infrastructure to meet the transformers’ needs.

One promising architecture proposed this month is training-to-test (TTT), which has been developed over a year and a half by researchers at Stanford, UC San Diego, UC Berkeley, and Meta. The research team says that TTT models can not only handle much more data than transformers, but they can do so without consuming as much computing power.

The Hidden State of Transformers

A fundamental element of transformers is the “hidden state,” which is essentially a long list of data. When a transformer processes something, it adds inputs to the hidden state to “remember” what it just processed. For example, if the model is working on a book, the values ​​in the hidden state will be things like representations of words (or parts of words).

“If you think of a transformer as an intelligent entity, then the lookup table—its hidden state—is the brain of the transformer,” Yu Sun, a Stanford postdoc and co-contributor to the TTT research, told TechCrunch. “This specialized brain enables the well-known capabilities of transformers, such as context-based learning.”

Hidden state is part of what makes transformers so powerful. But it also hampers them. To “say” even a single word about a book a transformer has just read, the model would have to scan its entire lookup table—a task as computationally intensive as rereading the entire book.

Sun and his team then came up with the idea of ​​replacing the hidden state with a machine learning model — like nested AI dolls, if you will, a model within a model.

This is a bit technical, but the gist is that the internal machine learning model of the TTT model, unlike a transformer’s lookup table, does not grow as it processes additional data. Instead, it encodes the data it processes into representative variables called weights, which makes TTT models very performant. No matter how much data a TTT model processes, the size of its internal model will not change.

Sun estimates that future TTT models could efficiently process billions of data items, from words to images to audio and video recordings. This far exceeds the capabilities of current models.

“Our system can say X words about a book without the computational complexity of rereading the book X times,” Sun said. “Large transformer-based video models, like Sora, can only process 10 seconds of video because they only have a lookup table ‘brain.’ Our ultimate goal is to develop a system that can process a long video that resembles the visual experience of a human lifetime.”

Skepticism around TTT models

Will TTT models ever replace Transformers? It’s possible. But it’s too early to say for sure.

TTT models are not a replacement for transformers. The researchers developed only two small models for the study, making the TTT method difficult to currently compare to some of the larger transformer implementations that exist.

“I think it’s a very interesting innovation, and if the data supports the claims that it delivers efficiencies, that’s great news, but I can’t tell you whether it’s better than existing architectures or not,” said Mike Cook, a senior lecturer in the department of computer science at King’s College London, who was not involved in the TTT research. “A former professor of mine used to tell a joke when I was an undergraduate: How do you solve a problem in computer science? Add another layer of abstraction. Adding a neural network inside a neural network certainly reminds me of that.”

Regardless, the accelerating pace of research into alternatives to transformers indicates a growing recognition of the need for a breakthrough.

This week, AI startup Mistral released a model, Codestral Mamba, based on another alternative to the transformer, called state-space models (SSMs). SSMs, like TTT models, appear to be more computationally efficient than transformers and can scale to larger data volumes.

AI21 Labs also studies SSMs, as does Cartesia, which was the originator of some of the earliest SSMs and namesakes of Codestral Mamba, Mamba, and Mamba-2.

If these efforts succeed, they could make generative AI even more accessible and widespread than it is now — for better or worse.

News Source : techcrunch.com
Gn tech

Back to top button