Conference paper

Routing Strategies for RoCE Networks in AI Clouds

Abstract

The rapid explosion of Artificial Intelligence (AI) workloads utilizing a growing number of accelerators has placed unprecedented demand on the network. These workloads typically leverage Remote Direct Memory Access (RDMA) and require a high-performance network fabric. While many purpose-built cloud networking solutions can provide high performance, efficiently utilizing these costly infrastructures require a fabric that is multi-tenant for ease of consumption. Furthermore, the fabric must be resilient to faults and for operational manageability. Resilient cloud networks typically employ mature Ethernet segmentation techniques over Clos topologies with Equal Cost Multi Path (ECMP) routing. ECMP hashes flows to paths, which in case of collisions can significantly degrade performance for large RDMA over Converged Ethernet (RoCE) flows. To mitigate ECMP penalties, we evaluate routing strategies with varying levels of operational complexity. We explore load balancing and path pinning solutions that leverage non-proprietary, mature technologies over commodity Ethernet. Our evaluation follows a three-fold strategy, focusing on the key dimensions of performance, resiliency, and operational complexity. By applying this methodology to representative implementations, we highlight the trade-offs. While all techniques are resilient, path pinning-based solutions excel at performance but introduce greater complexity. Specifically, path pinning achieves up to 1.6× improvement over ECMP for RoCE test traffic and up to 2.5× for NCCL AllReduce. These results validate the promising performance benefits of path pinning and highlight the need to explore less complex implementations for broader adoption. Our methodology can be used to rigorously evaluate future implementations in support of AI network design.

Related