Volume 32, Issue 3 e4863
SPECIAL ISSUE PAPER

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Sourav Chakraborty

Corresponding Author

Sourav Chakraborty

The Ohio State University, Columbus, Ohio

Sourav Chakraborty, The Ohio State University, Columbus, Ohio.

Email: [email protected]

Search for more papers by this author
Ignacio Laguna

Ignacio Laguna

Lawrence Livermore National Laboratory, Livermore, California

Search for more papers by this author
Murali Emani

Murali Emani

Lawrence Livermore National Laboratory, Livermore, California

Search for more papers by this author
Kathryn Mohror

Kathryn Mohror

Lawrence Livermore National Laboratory, Livermore, California

Search for more papers by this author
Dhabaleswar K. Panda

Dhabaleswar K. Panda

The Ohio State University, Columbus, Ohio

Search for more papers by this author
Martin Schulz

Martin Schulz

Technische Universität München, Munich, Germany

Search for more papers by this author
Hari Subramoni

Hari Subramoni

The Ohio State University, Columbus, Ohio

Search for more papers by this author
First published: 14 August 2018
Citations: 13

Summary

Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.