Ilias Iliadis
International Journal On Advances In Networks And Services
HPC applications are increasingly utilizing cloud resources due to their cost-effectiveness. Among these resources, spot compute instances present an opportunity to run applications at deep discounts as compared to on-demand instances. However, they present unique challenges for tightly-coupled HPC applications due to potential interruptions. Traditional parallel programming models like MPI are not inherently fault-tolerant, and existing methods to handle these interruptions are inefficient and require significant programmer effort. In this paper, we present Charm++ as an alternative solution that natively supports fault tolerance, dynamic load balancing, and resource rescaling. We present a tool to run Charm++ applications with a mix of on-demand and spot instances which can detect and efficiently handle spot interruptions. We show that using spot instances can result in up to 60% cost savings for our benchmark application.
Ilias Iliadis
International Journal On Advances In Networks And Services
Alessandro Pomponio
Kubecon + CloudNativeCon NA 2025
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Jose Manuel Bernabe' Murcia, Eduardo Canovas Martinez, et al.
MobiSec 2024