Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.

REFERENCES

1Cappello F, Geist A, Gropp B, Kale L, Kramer B, Snir M. Toward exascale resilience. Int J High Perform Comput Appl. 2009; 23(4): 374-388.
10.1177/1094342009347767
Web of Science® Google Scholar
2Hassani A, Skjellum A, Brightwell R. Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. Paper presented at: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014; Atlanta, GA.
Google Scholar
3Gamell M, Katz DS, Kolla H, Chen J, Klasky S, Parashar M. Exploring automatic, online failure recovery for scientific applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014; New Orleans, LA.
Google Scholar
4Sato K, Moody A, Mohror K, et al. FMI: fault tolerant messaging interface for fast and transparent recovery. Paper presented at: IEEE 28th International Parallel and Distributed Processing Symposium; 2014; Phoenix, AZ.
Google Scholar
5Fagg GE, Dongarra JJ. FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. Recent Advances in Parallel Virtual Machine and Message Passing Interface: 7th European PVM/MPI Users' Group Meeting Balatonfüred, Hungary, September 10-13, 2000 Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2000: 346-353.
10.1007/3-540-45255-9_47
Google Scholar
6Fagg GE, Bukovsky A, Dongarra JJ. HARNESS And fault tolerant MPI. Parallel Comput. 2001; 27(11): 1479-1495.
10.1016/S0167-8191(01)00100-4
Web of Science® Google Scholar
7Fang A, Laguna I, Sato K, Islam T, Mohror K. Fault tolerance assistant (FTA): an exception handling approach for MPI programs. In: Proceedings of the 3rd Workshop on Exascale MPI (at SC15); 2015; Austin, TX.
Google Scholar
8Teranishi K, Heroux MA. Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users' Group Meeting; 2014; Kyoto, Japan.
Google Scholar
9Laguna I, Gamblin T, Mohror K, Schulz M, Pritchard H, Davis N. A global exception fault tolerance model for MPI. Paper presented at: Workshop on Exascale MPI at Supercomputing Conference (ExaMPI); 2014; New Orleans, LA.
Google Scholar
10Laguna I, Richards DF, Gamblin T, et al. Evaluating and extending user-level fault tolerance in MPI applications. Int J High Perform Comput Appl. 2016; 30(3): 305-319.
10.1177/1094342015623623
Web of Science® Google Scholar
11Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra JJ. An evaluation of user-level failure mitigation support in MPI. Computing. 2013; 95(12): 1171-1184.
10.1007/s00607-013-0331-3
Web of Science® Google Scholar
12Bland W, Raffenetti K, Balaji P. Simplifying the recovery model of user-level failure mitigation. In: Proceedings of the 2014 Workshop on Exascale MPI (ExaMPI); 2014; New Orleans, LA.
Google Scholar
13Valiant LG. A bridging model for parallel computation. Commun ACM. 1990; 33(8): 103-111.
10.1145/79173.79181
Web of Science® Google Scholar
14 MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE: Network-Based Computing Laboratory website. 2001. http://mvapich.cse.ohio-state.edu/
Google Scholar
15Yoo AB, Jette MA, Grondona M. Slurm: simple Linux utility for resource management. In: Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003. Revised Paper. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2003: 44-60.
10.1007/10968987_3
Google Scholar
16O'shea BW, Bryan G, Bordner J, et al. Introducing Enzo, an AMR cosmology application. In: Adaptive Mesh Refinement - Theory Applications. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2005: 341-349.
10.1007/3-540-27039-6_24
Google Scholar
17Karlin I, Bhatele A, Keasler J, et al. Exploring traditional and emerging parallel programming models using a proxy application. Paper presented at: IEEE 27th International Symposium on Parallel and Distributed Processing; 2013; Boston, MA.
Google Scholar
18Mohd-Yusof J. Codesign molecular dynamics (coMD) proxy app. LA-UR-12-21782. Los Alamos, NM: Los Alamos National Lab; 2012.
Google Scholar
19Israel K, Krishna CM. Fault-Tolerant Systems. San Francisco, CA: Morgan Kaufmann; 2010.
Google Scholar
20Schulz M, De Supinski BR. PNMPI tools: a whole lot greater than the sum of their parts. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing; 2007; Reno, NV.
Google Scholar
21Gabriel E, Fagg GE, Bosilca G, et al. Open MPI: goals, concept, and design of a next generation MPI implementation. Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users' Group Meeting Budapest, Hungary, September 19 - 22, 2004. Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2004: 97-104.
10.1007/978-3-540-30218-6_19
Google Scholar
22Castain RH, Woodall TS, Daniel DJ, Squyres JM, Barrett B, Fagg GE. The open run-time environment (openRTE): a transparent multi-cluster environment for high-performance computing. European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting: 12th European PVM/MPI Users' Group Meeting Sorrento, Italy, September 18-21, 2005. Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2005: 225-232.
10.1007/11557265_31
Google Scholar
23Bosilca G, Bouteiller A, Guermouche A, et al. Failure detection and propagation in HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2016; Salt Lake City, UT.
Google Scholar
24Angskun T, Bosilca G, Dongarra J. Binomial graph a scalable and fault-tolerant logical network topology. Parallel and Distributed Processing and Applications: 5th International Symposium, ISPA 2007 Niagara Falls, Canada, August 29-31, 2007 Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2007: 471-482.
10.1007/978-3-540-74742-0_43
Google Scholar
25Balaji P, Buntinas D, Goodell D, et al. PMI: a scalable parallel process-management interface for extreme-scale systems. Recent Advances in the Message Passing Interface: 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2010: 31-41.
10.1007/978-3-642-15646-5_4
Google Scholar
26Chakraborty S, Subramoni H, Perkins J, Moody A, Arnold M, Panda DK. PMI extensions for scalable MPI startup. In: Proceedings of the 21st European MPI Users' Group Meeting; 2014; Kyoto, Japan.
Google Scholar
27Chakraborty S, Subramoni H, Moody A, Venkatesh A, Perkins J, Panda DK. Non-blocking PMI extensions for fast MPI startup. Paper presented at: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid); 2015; Shenzhen, China.
Google Scholar
28Gamell M, Teranishi K, Heroux MA, et al. Local recovery and failure masking for stencil-based applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2015; Austin, TX.
Google Scholar
29Gamell M, Teranishi K. Fenix Source Code. 2016. https://github.com/epizon-project/Fenix2016
Google Scholar
30Moody A, Bronevetsky G, Mohror K, De Supinski BR. Design, modeling and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis; 2010; New Orleans, LA.
Google Scholar
31Egwutuoha IP, Levy D, Selic B, Chen S. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput. 2013; 65(3): 1302-1326.
10.1007/s11227-013-0884-0
Web of Science® Google Scholar
32Chen Z, Fagg GE, Gabriel E, et al. Building fault survivable MPI programs with FT-MPI using diskless checkpointing. In: Proceedings of the 2005 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2005; Chicago, IL.
Google Scholar
33 Laboratory Lawrence Livermore National. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). 2010. http://codesign.llnl.gov/lulesh.php2013
Google Scholar
34Gangadharappa T, Koop M, Panda DK. Designing and evaluating MPI-2 dynamic process management support for infiniBand. Paper presented at: International Conference on Parallel Processing Workshops; 2009; Vienna, Austria.
Google Scholar
35Cera MC, Pezzi GP, Mathias EN, Maillard N, Navaux POA. Improving the dynamic creation of processes in MPI-2. Recent Advances in Parallel Virtual Machine and Message Passing Interface: 13th European PVM/MPI User's Group Meeting Bonn, Germany, September 17-20, 2006 Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2006.
10.1007/11846802_37
Google Scholar
36Hursey J, Squyres JM, Mattox TI, Lumsdaine A. The design and implementation of checkpoint/restart process fault tolerance for open MPI. Paper presented at: IEEE International Parallel and Distributed Processing Symposium; 2007; Rome, Italy.
Google Scholar
37Hursey J, Mattox TI, Lumsdaine A. Interconnect agnostic checkpoint/restart in open MPI. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing; 2009; Garching, Germany.
Google Scholar
38Sankaran S, Squyres JM, Barrett B, et al. The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl. 2005; 19(4): 479-493.
10.1177/1094342005056139
Web of Science® Google Scholar
39Moody A, Bronevetsky G, Mohror K, De Supinski BR. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis; 2010; New Orleans, LA.
Google Scholar
40Li K, Naughton JF, Plank JS. Low-latency, concurrent checkpointing for parallel programs. IEEE Trans Parallel Distrib Syst. 1994; 5(8): 874-879.
10.1109/71.298215
Web of Science® Google Scholar
41Walters JP, Chaudhary V. Replication-based fault tolerance for MPI applications. IEEE Trans Parallel Distrib Syst. 2009; 20(7): 997-1010.
10.1109/TPDS.2008.172
Web of Science® Google Scholar
42Chandrasekar RR, Venkatesh A, Hamidouche K, Panda DK. Power-check: an energy-efficient checkpointing framework for HPC clusters. Paper presented at: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing; 2015; Shenzhen, China.
Google Scholar
43Elnozahy EN(Mootaz), Alvisi L, Wang Y-M, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv. 2002; 34(3): 375-408.
10.1145/568522.568525
Web of Science® Google Scholar
44Lemarinier P, Bouteiller A, Krawezik G, Cappello F. Coordinated checkpoint versus message log for fault tolerant MPI. Int J High Perform Comput Netw. 2004; 2(2-4): 146-155.
10.1504/IJHPCN.2004.008899
Google Scholar
45Chandy KM, Lamport L. Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst. 1985; 3(1): 63-75.
10.1145/214451.214456
Web of Science® Google Scholar
46Coti C, Herault T, Lemarinier P, et al. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing; 2006; Tampa, FL.
Google Scholar
47Ferreira K, Stearley J, Laros III JH, et al. Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis; 2011; Seattle, Washington.
Google Scholar
48Hargrove PH, Duell JC. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J Phys Conf Ser. 2006; 46(1): 494.
10.1088/1742-6596/46/1/067
Web of Science® Google Scholar
49Plank JS, Beck M, Kingsley G, Li K. Libckpt: transparent checkpointing under UNIX: In: Proceedings of the USENIX 1995 Technical Conference Proceedings; 1994; New Orleans, LA.
Google Scholar
50Ahn J. 2-step algorithm for enhancing effectiveness of sender-based message logging. In: Proceedings of the 2007 Spring Simulation Multiconference-Volume 2; 2007; Norfolk, Virginia.
Google Scholar
51Johnson DB, Zwaenepoel W. Recovery in distributed systems using asynchronous message logging and checkpointing. In: Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing; 1988; Toronto, Canada.
Google Scholar
52Lemarinier P, Bouteiller A, Herault T, Krawezik G, Cappello F. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. Paper presented at: IEEE International Conference on Cluster Computing; 2004; San Diego, CA.
Google Scholar
53Huang K-H, Abraham JA. Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput. 1984; C-33(6): 518-528.
10.1109/TC.1984.1676475
Web of Science® Google Scholar
54Luk FT, Park H. An analysis of algorithm-based fault tolerance techniques. J Parallel Distrib Comput. 1988; 5(2): 172-184.
10.1016/0743-7315(88)90027-5
Web of Science® Google Scholar
55Li D, Chen Z, Wu P, Vetter JS. Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 2013; Denver, CO.
Google Scholar
56Bosilca G, Delmas R, Dongarra J, Langou J. Algorithm-based fault tolerance applied to high performance computing. J Parallel Distrib Comput. 2009; 69(4): 410-416.
10.1016/j.jpdc.2008.12.002
Web of Science® Google Scholar
57Banerjee P, Rahmeh JT, Stunkel C, et al. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Trans Comput. 1990; 39(9): 1132-1145.
10.1109/12.57055
Web of Science® Google Scholar
58Heroux MA. Toward resilient algorithms and applications. In: Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale; 2013; New York, NY.
Google Scholar
59Bosilca G, Bouteiller A, Cappello F, et al. MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing; 2002; Baltimore, MD.
Google Scholar

Citing Literature

Volume32, Issue3

Special Issue on Graph Computing (GRAPH 2017) and Special Issue of the Workshop on Exascale MPI (EXAMPI2017)

10 February 2020

e4863

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Summary

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Summary

REFERENCES

Citing Literature

References

Related

Information