Why a decades old architecture decision is impeding the power of AI computing

Most computers are based on the von Neumann architecture, which separates compute and memory. This arrangement has been perfect for conventional computing, but it creates a data traffic jam in AI computing.

AI computing has a reputation for consuming epic quantities of energy. This is partly because of the sheer volume of data being handled. Training often requires billions or trillions of pieces of information to create a model with billions of parameters. But that’s not the whole reason — it also comes down to how most computer chips are built.

Modern computer processors are quite efficient at performing the discrete computations they’re usually tasked with. Though their efficiency nosedives when they must wait for data to move back and forth between memory and compute, they’re designed to quickly switch over to work on some unrelated task. But for AI computing, almost all the tasks are interrelated, so there often isn’t much other work that can be done when the processor gets stuck waiting, said IBM Research scientist Geoffrey Burr.

In that scenario, processors hit what is called the von Neumann bottleneck, the lag that happens when data moves slower than computation. It’s the result of von Neumann architecture, found in almost every processor over the last six decades, wherein a processor’s memory and computing units are separate, connected by a bus. This setup has advantages, including flexibility, adaptability to varying workloads, and the ability to easily scale systems and upgrade components. That makes this architecture great for conventional computing, and it won’t be going away any time soon.

But for AI computing, whose operations are simple, numerous, and highly predictable, a conventional processor ends up working below its full capacity while it waits for model weights to be shuttled back and forth from memory. Scientists and engineers at IBM Research are working on new processors, like the AIU family, which use various strategies to break down the von Neumann bottleneck and supercharge AI computing.

Why does the von Neumann bottleneck exist?

The von Neumann bottleneck is named for mathematician and physicist John von Neumann, who first circulated a draft of his idea for a stored-program computer in 1945. In that paper, he described a computer with a processing unit, a control unit, memory that stored data and instructions, external storage, and input/output mechanisms. His description didn’t name any specific hardware — likely to avoid security clearance issues with the US Army, for whom he was consulting. Almost no scientific discovery is made by one individual, though, and von Neumann architecture is no exception. Von Neumann’s work was based on the work of J. Presper Eckert and John Mauchly, who invented the Electronic Numerical Integrator and Computer (ENIAC), the world’s first digital computer. In the time since that paper was written, von Neumann architecture has become the norm.

“The von Neumann architecture is quite flexible, that’s the main benefit,” said IBM Research scientist Manuel Le Gallo-Bourdeau. “That’s why it was first adopted, and that’s why it’s still the prominent architecture today.”

Discrete memory and computing units mean you can design them separately and configure them more or less any way you want. Historically, this has made it easier to design computing systems because the best components can be selected and paired, based on the application.

Even the cache memory, which is integrated into a single chip with the processor, can still be individually upgraded. “I’m sure there are implications for the processor when you make a new cache memory design, but it’s not as difficult as if they were coupled together,” Le Gallo-Bourdeau said. “They’re still separate. It allows some freedom in designing the cache separately from the processor.”

How the von Neumann bottleneck reduces efficiency

For AI computing, the von Neumann bottleneck creates a twofold efficiency problem: the number of model parameters (or weights) to move, and how far they need to move. More model weights mean larger storage, which usually means more distant storage, said IBM Research scientist Hsinyu (Sidney) Tsai. “Because the quantity of model weights is very large, you can’t afford to hold them for very long, so you need to keep discarding and reloading,” she said.

The main energy expenditure during AI runtime is spent on data transfers — bringing model weights back and forth from memory to compute. By comparison, the energy spent doing computations is low. In deep learning models, for example, the operations are almost all relatively simple matrix vector multiplication problems. Compute energy is still around 10% of modern AI workloads, so it isn’t negligible, said Tsai. “It is just found to be no longer dominating energy consumption and latency, unlike in conventional workloads,” she added.

About a decade ago, the von Neumann bottleneck wasn’t a significant issue because processors and memory weren’t so efficient, at least compared to the energy that was spent to transfer data, said Le Gallo-Bourdeau. But data transfer efficiency hasn’t improved as much as processing and memory have over the years, so now processors can complete their computations much more quickly, leaving them sitting idle while data moves across the von Neumann bottleneck.

The farther away the memory is from the processor, the more energy it costs to move it. On a basic physical level, an electrical copper wire is charged to propagate a 1, and it’s discharged to propagate a 0. The energy spent charging and discharging the wires is proportional to their length, so the longer the wire is, the more energy you spend. This also means greater latency, as it takes more time for the charge to dissipate or propagate the longer the wire is.

Admittedly, the time and energy cost of each data transfer is low, but every time you want to propagate data through a large language model, you need to load up to billions of weights from the memory. This could mean using the DRAM from one or more other GPUs, because one GPU doesn’t have enough memory to store them all. After they’re downloaded to the processor, it performs its computations and sends the result to another memory location for further processing.

Aside from eliminating the von Neumann bottleneck, one solution includes closing that distance. “The entire industry is working to try to improve data localization,” Tsai said. IBM Research scientists recently announced such an approach: a polymer optical waveguide for co-packaged optics. This module brings the speed and bandwidth density of fiber optics to the edge of chips, supercharging their connectivity and hugely reducing model training time and energy costs.

With currently available hardware, though, the result of all these data transfers is that training an LLM can easily take months, consuming more energy than a typical US home does in that time. And AI doesn’t stop needing energy after model training. Inferencing has similar computational requirements, meaning that the von Neumann bottleneck slows it down in a similar fashion.

An infographic comparing von Neumann architecture to in-memory computing.png — a. In a conventional computing system, when an operation f is performed on data D, D has to be moved into a processing unit, leading to significant costs in latency and energy. b. In the case of in-memory computing, f(D) is performed within a computational memory unit by exploiting the physical attributes of the memory devices, thus obviating the need to move D to the processing unit. The computational tasks are performed within the confines of the memory array and its peripheral circuitry, albeit without deciphering the content of the individual memory elements. Both charge-based memory technologies, such as SRAM, DRAM, and flash memory, and resistance-based memory technologies, such as RRAM, PCM, and STT-MRAM, can serve as elements of such a computational memory unit. Source: Nature Nanotechnology

Getting around the bottleneck

For the most parts, model weights are stationary, and AI computing is memory-centric, rather than compute heavy, said Le Gallo-Bourdeau. “You have a fixed set of synaptic weights, and you just need to propagate data through them.”

This quality has enabled him and his colleagues to pursue analog in-memory computing, which integrates memory with processing, using the laws of physics to store weights. One of these approaches is phase-change memory (PCM), which stores model weights in the resistivity of a chalcogenide glass, which is changed by applying an electrical current.

“This way we can reduce the energy that is spent in data transfers and mitigate the von Neumann bottleneck,” said Le Gallo-Bourdeau. In-memory computing isn’t the only way to work around the von Neumann bottleneck, though.

The AIU NorthPole is a processor that stores memory in digital SRAM, and while its memory isn’t intertwined with compute in the same way as analog chips, its numerous cores each has access to local memory — making it an extreme example of near-memory computing. Experiments have already demonstrated the power and promise of this architecture. In recent inference tests run on a 3-billion-parameter LLM developed from IBM’s Granite-8B-Code-Base model, NorthPole was 47 times faster than the next most energy-efficient GPU and was 73 times more energy efficient than the next lowest latency GPU.

It’s also important to note that models trained on von Neumann hardware can be run on non-von Neumann devices. In fact, for analog in-memory computing, it’s essential. PCM devices aren’t durable enough to have their weights changed over and over, so they’re used to deploy models that have been trained on conventional GPUs. Durability is a comparative advantage of SRAM memory in near-memory or in-memory computing, as it can be rewritten infinitely.

Why von Neumann computing isn’t going away

While von Neumann architecture creates a bottleneck for AI computing, for other applications, it’s perfectly suited. Sure, it causes issues in model training and inference, but von Neumann architecture is perfect for processing computer graphics or other compute-heavy processes. And when 32- or 64-bit floating point precision is called for, the low precision of in-memory computing isn’t up to the task.

“For general purpose computing, there's really nothing more powerful than the von Neumann architecture,” said Burr. Under these circumstances, bytes are either operations or operands that are moving on a bus from a memory to a processor. “Just like an all-purpose deli where somebody might order some salami or pepperoni or this or that, but you're able to switch between them because you have the right ingredients on hand, and you can easily make six sandwiches in a row.” Special-purpose computing, on the other hand, may involve 5,000 tuna sandwiches for one order — like AI computing as it shuttles static model weights.

Even when building their in-memory AIU chips, IBM Researchers include some conventional hardware for the necessary high-precision operations.

Even as scientists and engineers work on new ways to eliminate the von Neumann bottleneck, experts agree that the future will likely include both hardware architectures, said Le Gallo-Bourdeau. “What makes sense is some mix of von Neumann and non-von Neumann processors to each handle the operations they are best at.”

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter