Tech

Researchers disrupt AI status quo by eliminating matrix multiplication in LLMs

Enlarge / Illustration of a brain inside a light bulb.

Researchers say they have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally rethinks the operations of neural networks that are currently accelerated by GPU chips. The findings, detailed in a recent preprint paper by researchers at the University of California Santa Cruz, UC Davis, LuxiTech and Soochow University, could have profound implications for the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural network computing tasks today, and GPUs are particularly good at performing calculations quickly because they can perform a large number of processing operations. parallel multiplication. This capability momentarily made Nvidia the most valuable company in the world last week; the company currently has an estimated 98% market share for data center GPUs, which are commonly used to power AI systems such as ChatGPT and Google Gemini.

In the new paper, titled “Scalable Language Modeling without MatMul,” the researchers describe creating a custom model of 2.7 billion parameters without using MatMul, delivering similar performance to large language models (LLMs). conventional. They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per second on a GPU accelerated by a custom-programmed FPGA chip that uses approximately 13 watts of power (not including GPU power consumption ). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.

The paper does not provide power estimates for conventional LLMs, but this UC Santa Cruz article estimates about 700 watts for a conventional model. However, in our experience, you can run a 2.7B settings version of Llama 2 fine on a home PC with an RTX 3060 (which uses around 200 watts peak) powered by a 500 watt PSU. So if you could theoretically completely run an LLM in just 13 watts on an FPGA (without GPU), that would represent a 38x decrease in power consumption.

The technique has not yet been peer-reviewed, but the researchers (Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou and Jason Eshraghian) say their work calls into question the dominant paradigm according to which matrix multiplication operations are essential for building efficient language models. They argue that their approach could make large language models more accessible, efficient and sustainable, particularly for deployment on resource-constrained hardware like smartphones.

Remove matrix math

In the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint in October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights in language models, successfully scaling up to 3 billion parameters while maintaining competitive performance.

However, they note that BitNet still relies on matrix multiplications in its self-attention mechanism. The limitations of BitNet motivated the current study, pushing them to develop a completely “MatMul-free” architecture capable of maintaining performance while eliminating matrix multiplications even in the attention mechanism.

News Source : arstechnica.com
Gn tech

Back to top button