Jannis Born, Matteo Manica
ICLR 2022
Backward error recovery, based on checkpointing and rollback, is often used for implementing fault tolerance in multicomputer systems. During failure-free operation the process states are regularly saved, and after a fault is detected the system is rolled back to a previously saved state. Four classes of techniques can be distinguished: semiautomatic techniques, message logging, coordinated checkpointing, and hybrid techniques. The authors provide a survey of these alternatives and discuss the overhead possibly involved, allowing the user to choose an optimal checkpointing and rollback technique for given facilities and applications.