EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
Corresponding Author
Sourav Chakraborty
The Ohio State University, Columbus, Ohio
Sourav Chakraborty, The Ohio State University, Columbus, Ohio.
Email: [email protected]
Search for more papers by this authorIgnacio Laguna
Lawrence Livermore National Laboratory, Livermore, California
Search for more papers by this authorMurali Emani
Lawrence Livermore National Laboratory, Livermore, California
Search for more papers by this authorKathryn Mohror
Lawrence Livermore National Laboratory, Livermore, California
Search for more papers by this authorCorresponding Author
Sourav Chakraborty
The Ohio State University, Columbus, Ohio
Sourav Chakraborty, The Ohio State University, Columbus, Ohio.
Email: [email protected]
Search for more papers by this authorIgnacio Laguna
Lawrence Livermore National Laboratory, Livermore, California
Search for more papers by this authorMurali Emani
Lawrence Livermore National Laboratory, Livermore, California
Search for more papers by this authorKathryn Mohror
Lawrence Livermore National Laboratory, Livermore, California
Search for more papers by this authorSummary
Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.
REFERENCES
- 1Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M. Toward exascale resilience. Int J High Perform Comput Appl. 2009; 23(4): 374-388.
- 2Hassani A, Skjellum A, Brightwell R. Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. Paper presented at: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014; Atlanta, GA.
- 3Gamell M, Katz DS, Kolla H, Chen J, Klasky S, Parashar M. Exploring automatic, online failure recovery for scientific applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014; New Orleans, LA.
- 4Sato K, Moody A, Mohror K, et al. FMI: fault tolerant messaging interface for fast and transparent recovery. Paper presented at: IEEE 28th International Parallel and Distributed Processing Symposium; 2014; Phoenix, AZ.
- 5Fagg GE, Dongarra JJ. FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. Recent Advances in Parallel Virtual Machine and Message Passing Interface: 7th European PVM/MPI Users' Group Meeting Balatonfüred, Hungary, September 10-13, 2000 Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2000: 346-353.
10.1007/3-540-45255-9_47 Google Scholar
- 6Fagg GE, Bukovsky A, Dongarra JJ. HARNESS And fault tolerant MPI. Parallel Comput. 2001; 27(11): 1479-1495.
- 7Fang A, Laguna I, Sato K, Islam T, Mohror K. Fault tolerance assistant (FTA): an exception handling approach for MPI programs. In: Proceedings of the 3rd Workshop on Exascale MPI (at SC15); 2015; Austin, TX.
- 8Teranishi K, Heroux MA. Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users' Group Meeting; 2014; Kyoto, Japan.
- 9Laguna I, Gamblin T, Mohror K, Schulz M, Pritchard H, Davis N. A global exception fault tolerance model for MPI. Paper presented at: Workshop on Exascale MPI at Supercomputing Conference (ExaMPI); 2014; New Orleans, LA.
- 10Laguna I, Richards DF, Gamblin T, et al. Evaluating and extending user-level fault tolerance in MPI applications. Int J High Perform Comput Appl. 2016; 30(3): 305-319.
- 11Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra JJ. An evaluation of user-level failure mitigation support in MPI. Computing. 2013; 95(12): 1171-1184.
- 12Bland W, Raffenetti K, Balaji P. Simplifying the recovery model of user-level failure mitigation. In: Proceedings of the 2014 Workshop on Exascale MPI (ExaMPI); 2014; New Orleans, LA.
- 13Valiant LG. A bridging model for parallel computation. Commun ACM. 1990; 33(8): 103-111.
- 14 MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE: Network-Based Computing Laboratory website. 2001. http://mvapich.cse.ohio-state.edu/
- 15Yoo AB, Jette MA, Grondona M. Slurm: simple Linux utility for resource management. In: Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003. Revised Paper. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2003: 44-60.
10.1007/10968987_3 Google Scholar
- 16O'shea BW, Bryan G, Bordner J, et al. Introducing Enzo, an AMR cosmology application. In: Adaptive Mesh Refinement - Theory Applications. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2005: 341-349.
10.1007/3-540-27039-6_24 Google Scholar
- 17Karlin I, Bhatele A, Keasler J, et al. Exploring traditional and emerging parallel programming models using a proxy application. Paper presented at: IEEE 27th International Symposium on Parallel and Distributed Processing; 2013; Boston, MA.
- 18Mohd-Yusof J. Codesign molecular dynamics (coMD) proxy app. LA-UR-12-21782. Los Alamos, NM: Los Alamos National Lab; 2012.
- 19Israel K, Krishna CM. Fault-Tolerant Systems. San Francisco, CA: Morgan Kaufmann; 2010.
- 20Schulz M, De Supinski BR. PNMPI tools: a whole lot greater than the sum of their parts. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing; 2007; Reno, NV.
- 21Gabriel E, Fagg GE, Bosilca G, et al. Open MPI: goals, concept, and design of a next generation MPI implementation. Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users' Group Meeting Budapest, Hungary, September 19 - 22, 2004. Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2004: 97-104.
10.1007/978-3-540-30218-6_19 Google Scholar
- 22Castain RH, Woodall TS, Daniel DJ, Squyres JM, Barrett B, Fagg GE. The open run-time environment (openRTE): a transparent multi-cluster environment for high-performance computing. European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting: 12th European PVM/MPI Users' Group Meeting Sorrento, Italy, September 18-21, 2005. Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2005: 225-232.
10.1007/11557265_31 Google Scholar
- 23Bosilca G, Bouteiller A, Guermouche A, et al. Failure detection and propagation in HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2016; Salt Lake City, UT.
- 24Angskun T, Bosilca G, Dongarra J. Binomial graph a scalable and fault-tolerant logical network topology. Parallel and Distributed Processing and Applications: 5th International Symposium, ISPA 2007 Niagara Falls, Canada, August 29-31, 2007 Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2007: 471-482.
10.1007/978-3-540-74742-0_43 Google Scholar
- 25Balaji P, Buntinas D, Goodell D, et al. PMI: a scalable parallel process-management interface for extreme-scale systems. Recent Advances in the Message Passing Interface: 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2010: 31-41.
10.1007/978-3-642-15646-5_4 Google Scholar
- 26Chakraborty S, Subramoni H, Perkins J, Moody A, Arnold M, Panda DK. PMI extensions for scalable MPI startup. In: Proceedings of the 21st European MPI Users' Group Meeting; 2014; Kyoto, Japan.
- 27Chakraborty S, Subramoni H, Moody A, Venkatesh A, Perkins J, Panda DK. Non-blocking PMI extensions for fast MPI startup. Paper presented at: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid); 2015; Shenzhen, China.
- 28Gamell M, Teranishi K, Heroux MA, et al. Local recovery and failure masking for stencil-based applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2015; Austin, TX.
- 29Gamell M, Teranishi K. Fenix Source Code. 2016. https://github.com/epizon-project/Fenix2016
- 30Moody A, Bronevetsky G, Mohror K, De Supinski BR. Design, modeling and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis; 2010; New Orleans, LA.
- 31Egwutuoha IP, Levy D, Selic B, Chen S. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput. 2013; 65(3): 1302-1326.
- 32Chen Z, Fagg GE, Gabriel E, et al. Building fault survivable MPI programs with FT-MPI using diskless checkpointing. In: Proceedings of the 2005 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2005; Chicago, IL.
- 33 Laboratory Lawrence Livermore National. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). 2010. http://codesign.llnl.gov/lulesh.php2013
- 34Gangadharappa T, Koop M, Panda DK. Designing and evaluating MPI-2 dynamic process management support for infiniBand. Paper presented at: International Conference on Parallel Processing Workshops; 2009; Vienna, Austria.
- 35Cera MC, Pezzi GP, Mathias EN, Maillard N, Navaux POA. Improving the dynamic creation of processes in MPI-2. Recent Advances in Parallel Virtual Machine and Message Passing Interface: 13th European PVM/MPI User's Group Meeting Bonn, Germany, September 17-20, 2006 Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2006.
10.1007/11846802_37 Google Scholar
- 36Hursey J, Squyres JM, Mattox TI, Lumsdaine A. The design and implementation of checkpoint/restart process fault tolerance for open MPI. Paper presented at: IEEE International Parallel and Distributed Processing Symposium; 2007; Rome, Italy.
- 37Hursey J, Mattox TI, Lumsdaine A. Interconnect agnostic checkpoint/restart in open MPI. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing; 2009; Garching, Germany.
- 38Sankaran S, Squyres JM, Barrett B, et al. The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl. 2005; 19(4): 479-493.
- 39Moody A, Bronevetsky G, Mohror K, De Supinski BR. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis; 2010; New Orleans, LA.
- 40Li K, Naughton JF, Plank JS. Low-latency, concurrent checkpointing for parallel programs. IEEE Trans Parallel Distrib Syst. 1994; 5(8): 874-879.
- 41Walters JP, Chaudhary V. Replication-based fault tolerance for MPI applications. IEEE Trans Parallel Distrib Syst. 2009; 20(7): 997-1010.
- 42Chandrasekar RR, Venkatesh A, Hamidouche K, Panda DK. Power-check: an energy-efficient checkpointing framework for HPC clusters. Paper presented at: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing; 2015; Shenzhen, China.
- 43Elnozahy EN(Mootaz), Alvisi L, Wang Y-M, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv. 2002; 34(3): 375-408.
- 44Lemarinier P, Bouteiller A, Krawezik G, Cappello F. Coordinated checkpoint versus message log for fault tolerant MPI. Int J High Perform Comput Netw. 2004; 2(2-4): 146-155.
10.1504/IJHPCN.2004.008899 Google Scholar
- 45Chandy KM, Lamport L. Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst. 1985; 3(1): 63-75.
- 46Coti C, Herault T, Lemarinier P, et al. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing; 2006; Tampa, FL.
- 47Ferreira K, Stearley J, Laros III JH, et al. Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis; 2011; Seattle, Washington.
- 48Hargrove PH, Duell JC. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J Phys Conf Ser. 2006; 46(1): 494.
- 49Plank JS, Beck M, Kingsley G, Li K. Libckpt: transparent checkpointing under UNIX: In: Proceedings of the USENIX 1995 Technical Conference Proceedings; 1994; New Orleans, LA.
- 50Ahn J. 2-step algorithm for enhancing effectiveness of sender-based message logging. In: Proceedings of the 2007 Spring Simulation Multiconference-Volume 2; 2007; Norfolk, Virginia.
- 51Johnson DB, Zwaenepoel W. Recovery in distributed systems using asynchronous message logging and checkpointing. In: Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing; 1988; Toronto, Canada.
- 52Lemarinier P, Bouteiller A, Herault T, Krawezik G, Cappello F. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. Paper presented at: IEEE International Conference on Cluster Computing; 2004; San Diego, CA.
- 53Huang K-H, Abraham JA. Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput. 1984; C-33(6): 518-528.
- 54Luk FT, Park H. An analysis of algorithm-based fault tolerance techniques. J Parallel Distrib Comput. 1988; 5(2): 172-184.
- 55Li D, Chen Z, Wu P, Vetter JS. Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 2013; Denver, CO.
- 56Bosilca G, Delmas R, Dongarra J, Langou J. Algorithm-based fault tolerance applied to high performance computing. J Parallel Distrib Comput. 2009; 69(4): 410-416.
- 57Banerjee P, Rahmeh JT, Stunkel C, et al. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Trans Comput. 1990; 39(9): 1132-1145.
- 58Heroux MA. Toward resilient algorithms and applications. In: Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale; 2013; New York, NY.
- 59Bosilca G, Bouteiller A, Cappello F, et al. MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing; 2002; Baltimore, MD.