Effective cluster management for large scale AI and GPUs: Challenges and opportunities
Abstract
There are new challenges in managing large GPU clusters dedicated to cloud native AI workloads. The workload mix is diverse, and GPUs must be effectively utilized and dynamically shared across multiple teams. Furthermore, GPUs are subject to a variety of performance degradations and faults that can severely impact multi-GPU jobs, thus requiring continuous monitoring and enhanced diagnostics. Cloud native tools such Kubeflow, Kueue and others, are the building blocks for large scale GPU clusters used by teams across IBM Research for training, tuning, and inference jobs. In this talk, IBM Research will share and demonstrate lessons learnt on how they configure large scale GPU clusters and the development of Kubernetes native automation to run health checks on GPUs and report health. Finally, will show the use of diagnostics to enable both the dynamic adjustment of quotas to account for faulty GPUs, and the automatic steering of new and existing workloads away from nodes with faulty GPUs.