Software engineers develop a way to run AI language models without matrix multiplication

Software engineers develop a way to run AI language models without matrix multiplication

Presentation of the LM without MatMul. The sequence of operations is shown for vanilla self-attention (top left), token mixer without MatMul (top right), and ternary accumulations. The MatMul-free LM uses a MatMul-free token mixer (MLGRU) and a MatMul-free channel mixer (MatMul-free GLU) to maintain the transformer-like architecture while reducing computational costs. Credit: arXiv (2024). DOI: 10.48550/arxiv.2406.02528

A team of software engineers from the University of California, working with a colleague from Soochow University and another from LuxiTec, have developed a way to run AI language models without resorting to matrix multiplication . The team published an article on arXiv preprint server describing their new approach and its effectiveness in testing.

LLMs like ChatGPT have grown in power, as have the computing resources they require. Part of the process of running LLMs is to perform matrix multiplication (MatMul), where data is combined with weights in neural networks to provide the best possible answers to queries.

Early on, AI researchers discovered that graphics processing units (GPUs) were ideally suited for neural network applications because they can run multiple processes simultaneously, in this case, multiple MatMuls. But today, even with huge GPU clusters, MatMuls have become bottlenecks as the power of LLMs increases with the number of people using them.

In this new study, the research team claims to have developed a way to run AI language models without the need to perform MatMuls – and to do it just as efficiently.

To achieve this feat, the research team took a new approach to how data is weighted: they replaced the current method that relies on 16-bit floating points with one that uses only three: {-1, 0, 1} , as well as new functions that perform the same types of operations as the previous method.

They also developed new quantification techniques that helped improve performance. With less weight, less processing is required, resulting in less computing power. But they also radically changed the way LLMs are processed by using what they describe as a MatMul-free gated linear recurrent unit (MLGRU) in place of traditional transformer blocks.

By testing their new ideas, the researchers found that a system using their new approach achieved performance comparable to that of currently used state-of-the-art systems. At the same time, they found that their system consumed significantly less computing power and electricity than is typically the case with traditional systems.

More information:
Rui-Jie Zhu et al, Scalable language modeling without MatMul, arXiv (2024). DOI: 10.48550/arxiv.2406.02528

Journal information:

© 2024 Science X Network

Quote: Software engineers develop way to run AI language models without matrix multiplication (June 26, 2024) retrieved June 26, 2024 from -matrix-multiplication.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without written permission. The content is provided for information purposes only.

News Source :
Gn tech

Back to top button