Ilias Iliadis
International Journal On Advances In Networks And Services
Vela is a cloud-native system designed for LLM training workloads built using off-the-shelf hardware, Linux KVM-based virtualization, and a virtualized RDMA over Converged Ethernet (RoCE) network. Vela virtual machines (VMs) support peer-to-peer DMA between the GPUs and SRIOV-based network interface.
In this paper, we share Vela's key architectural aspects with details from an Nvidia A100 GPU-based deployment in one of our data centers. Throughout the paper, we share insights and experiences from designing, building, and operating the system over a ~2.5 year timeframe to highlight the capabilities of readily available software and hardware technologies and the improvement opportunities for future AI systems, thereby making AI infrastructure more accessible to a broader community. As we evaluated the system for performance at ~1500 GPU scale, we achieved ~80% of the ideal throughput while training a 50 billion parameter decoder model using model parallelism, and ~70% per GPU FLOPS compared to a single VM with the High-Performance Linpack benchmark.
Ilias Iliadis
International Journal On Advances In Networks And Services
Alessandro Pomponio
Kubecon + CloudNativeCon NA 2025
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Jose Manuel Bernabe' Murcia, Eduardo Canovas Martinez, et al.
MobiSec 2024