Volume 32, Issue 3 e4890
SPECIAL ISSUE PAPER

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Scott Levy

Corresponding Author

Scott Levy

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Scott Levy, Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico.

Email: [email protected]

Search for more papers by this author
Kurt B. Ferreira

Kurt B. Ferreira

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Search for more papers by this author
Patrick Widener

Patrick Widener

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Search for more papers by this author
First published: 09 September 2018
Citations: 2

Summary

Coordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large-scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next-generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.