F3: Serving Files Efficiently in Serverless Computing
Alex Merenstein, Vasily Tarasov, et al.
SYSTOR 2023
Cluster system management (CSM) was co-designed with the Department of Energy Labs to provide the support necessary to effectively manage the Summit and Sierra supercomputers. The CSM system administration tools provide a unified view of a large-scale cluster and the ability to examine and understand data from multiple sources. CSM consists of five components: 1) application programming interfaces (APIs) and infrastructure; 2) Big Data Store; 3) support for reliability, availability, and serviceability (RAS); 4) Diagnostic and Health Check; and 5) support for job management. APIs and infrastructure provide lightweight daemons for compute nodes, hardware and software inventory collection, job accounting, and RAS. Logs, environmental data, and performance data are collected in the Big Data Store for analysis. RAS events can trigger corrective actions by CSM. Diagnostic and Health Check are provided through a diagnostic framework and test results collection. To support job management, CSM coordinates with the Job Step Manager to provide an overlay network of JSM daemons. CSM is an open source and available at https://github.com/IBM/CAST. Documentation can be found at https://cast.readthedocs.io.
Alex Merenstein, Vasily Tarasov, et al.
SYSTOR 2023
Jose Santos, Chen Wang, et al.
IEEE TNSM
Pravein Govindan Kannan, Brent Salisbury, et al.
arXiv
Runyu Jin, Paul Muench, et al.
ICPE 2024