Conference paper

Speeding up Model Loading with fastsafetensors

Abstract

The rapid increases in model parameter sizes present a new challenge in pre-trained model loading. Currently, machine learning code often instantiates each parameter as a tensor object in host memory, and then copies them to device memory sequentially. We found that this typical approach results in a lack of concurrency and inefficient file access patterns, leading to slow model loading. In this work, we design and implement fastsafetensors to optimize the deserialization of parameters in pre-trained models. Our key idea is to first copy a group of on-disk parameters to device memory and then directly instantiate them as tensor objects. This simple approach enables further optimizations in low-level I/O and high-level tensor preprocessing through parallelized copying, peer-to-peer DMA, and GPU offloading. Experimental results show performance improvements ranging from 4.8x to 7.5x when loading the LLaMA models with 7, 13, and 70 billion parameters, the Falcon model with 40 billion parameters, and the Bloom model with 176 billion parameters.

Related