Lossless compression tailored for AI
ZipNN, IBM’s new open-source compression library, can cut AI storage costs by a third, and speed up transfers by up to 150% — with zero performance loss.
If you’ve shared photos or music online, you’ve undoubtedly handled a JPG or MP3 file. Both compression formats shrink files by discarding details perceived as less important to the eye and ear. ZIP compression, by contrast, consolidates redundant information so that files can be restored to their original size and quality when decompressed.
Due to its “lossless” nature, ZIP is often used to archive documents. It wasn’t really considered suitable for compressing AI models with billions of seemingly random numerical weights — until now. When researchers looked closer at model weights, they discovered patterns that could be exploited to shrink both model size and bandwidth constraints.
In collaboration with Boston University, Dartmouth, MIT, and Tel Aviv University, IBM researchers created ZipNN, an open-source library that automatically finds the best compression algorithm for a given model based on its properties. ZipNN can shrink popular LLM formats by a third and compress and decompress models 1.5 times faster than the next best technique. The team reported their findings in a new study to be presented at the upcoming 2025 IEEE Cloud conference.
“Our method can bring down AI storage and transfer costs with virtually no downside,” said Moshik Hershcovitch, an IBM researcher focused on AI and cloud infrastructure. “When you unzip the file, it returns to its original state. You don’t lose anything.”
There are several ways of slimming down an AI model to reduce its size and the cost of sending it over a network. The model can be pruned to trim extraneous weights. Its accuracy, or precision level, can also be reduced through quantization. Or it can be distilled into a more compact version of itself through a student-teacher learning process. All three compression methods can boost AI inferencing speeds, but because they remove information from the model, quality sometimes suffers.
Lossless compression, by contrast, temporarily removes repetitive information and then puts it back; LZ compression replaces long sequences with short pointers while entropy encoding replaces frequently used elements with abbreviated representations.
Information in AI models gets stored as numerical weights, which are typically represented by a floating-point number between 0 and 1. Each number has three parts: a sign, an exponent, and a fraction. While the sign and fraction are typically random, IBM researchers recently discovered that the exponents are highly skewed. Out of 256 possible values, the same 12 appear 99.9% of the time.
The team behind ZipNN realized they could use a form of entropy encoding to exploit this imbalance. They separated the exponents of each floating-point number from its random signs and fractions and then used Huffman encoding to compress the exponents.
How much ZipNN can reduce an AI model’s size depends on the model’s bit format, and how much space the exponents occupy. For comparison, exponents take up half the bits in newer AI models with a BF16 format, but only one quarter of the bits in older FP32-format models.
Not surprisingly, ZipNN worked best with BF16 models in the Meta Llama, IBM Granite, and Mistral model series, reducing their size by 33%, an 11% improvement over the next best method, Meta’s Zstandard (zstd), the team reported in their study. Compression and decompression speeds also improved, with Llama 3.1 models showing an average 62% boost over zstd.
The savings were less, but still significant, for FP32-format models. ZipNN reduced their size by 17%, an 8% improvement over ztsd, while also improving transfer speeds.
The researchers discovered they could get an even higher compression ratio in some popular models that hadn’t been fine-tuned by exploiting redundances in the fractions of each weight. Dividing each fraction into three streams, in what they call “byte grouping,” the researchers cut the size of Meta’s xlm-RoBERTa model by more than half.
Byte grouping was recently implemented by Hugging Face into its new storage backend. In a blog post announcing the move in February, one author noted in the comments section that their use of “byte grouping inspired by ZipNN” had saved about 20% in storage costs.
The ZipNN method could save time and money for enterprises and other organizations that train, store, or serve AI models at scale. A full implementation of ZipNN at Hugging Face, which serves more than a million models per day, could eliminate petabytes of storage data, and exabytes of networking data, the researchers estimate. It could also reduce the time that users spend uploading and downloading models from the site.
Storage of “checkpoint” models is another potential use case. Developers save hundreds to thousands of early versions for each finished AI model they produce. Shrinking even a fraction of these rough draft models could lead to substantial cost savings. The team behind IBM Granite plans to soon implement ZipNN to compress their checkpoint AI models.
The researchers behind ZipNN are also exploring how to integrate the library with vLLM, the open-source platform for efficient AI inferencing.
Image: Figure 2 with caption: Information in AI models gets stored as numerical weights, which are typically represented by a floating-point number between 0 and 1. Each number has three parts: a sign, an exponent, and a fraction. While the sign and fraction are typically random, researchers recently discovered that the exponents are highly skewed. Out of 256 possible values, 12 appear 99.9% of the time.
Related posts
- NewsMike Murphy
Watsonx Code Assistant for Z is the Rosetta Stone for mainframes
NewsPeter HessHow to know if your AI agents are working as intended
ResearchMike MurphyIBM’s Dmitry Krotov wants to crack the ‘physics’ of memory
ResearchKim Martineau