Build, Operate, and Use a multi-tenant AI cluster based entirely on open source

Claudia Misale; Olivier Tardieu; David Grove

KubeCon EU 2025

Talk

01 Apr 2025

Build, Operate, and Use a multi-tenant AI cluster based entirely on open source

Abstract

With GPUs being scarce and costly, multi-tenant Kubernetes clusters that can queue and prioritize complex, heterogeneous AI/ML workloads while achieving both high utilization and fair sharing, are a necessity for many organizations. This tutorial will teach the audience how to build, operate, and use an AI cluster. Starting from either a managed or on-premise Kubernetes cluster, we will demonstrate how to install and configure a number of open source projects (and only open source projects) such as Kueue, Kubeflow, PyTorch, Ray, vLLM, and Autopilot to support the full AI model lifecycle (from data preprocessing to LLM training and inference), configure teams and quotas, monitor GPUs, and to a large degree automate fault detection and recovery. By the end of the tutorial the participants will have a thorough understanding of the AI software stack refined by IBM Research over several years to effectively manage and utilize thousands of GPUs. Come to learn the recipe and try it at home!

Conference paper