Hardware MPI message matching: Insights into MPI matching behavior to inform design
Kurt Ferreira
Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico
Search for more papers by this authorRyan E. Grant
Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico
Search for more papers by this authorMichael J. Levenhagen
Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico
Search for more papers by this authorCorresponding Author
Scott Levy
Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico
Scott Levy, Center for Computing Research, Sandia National Laboratories, PO Box 5800, Albuquerque, NM 87185.
Email: [email protected]
Search for more papers by this authorTaylor Groves
National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, California
Search for more papers by this authorKurt Ferreira
Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico
Search for more papers by this authorRyan E. Grant
Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico
Search for more papers by this authorMichael J. Levenhagen
Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico
Search for more papers by this authorCorresponding Author
Scott Levy
Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico
Scott Levy, Center for Computing Research, Sandia National Laboratories, PO Box 5800, Albuquerque, NM 87185.
Email: [email protected]
Search for more papers by this authorTaylor Groves
National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, California
Search for more papers by this authorSummary
This paper explores key differences of MPI match lists for several important United States Department of Energy (DOE) applications and proxy applications. This understanding is critical in determining the most promising hardware matching design for any given high-speed network. The results of MPI match list studies for the major open-source MPI implementations, MPICH and Open MPI, are presented, and we modify an MPI simulator, LogGOPSim, to provide match list statistics. These results are discussed in the context of several different potential design approaches to MPI matching–capable hardware. The data illustrate the requirements for different hardware designs in terms of performance and memory capacity. This paper's contributions are the collection and analysis of data to help inform hardware designers of common MPI requirements and highlight the difficulties in determining these requirements by only examining a single MPI implementation.
REFERENCES
- 1Underwood KD, Brightwell R. The impact of MPI queue usage on message latency. Paper presented at: International Conference on Parallel Processing (ICPP); 2004; Montreal, Canada.
- 2Brightwell R, Goudy S, Underwood K. A preliminary analysis of the MPI queue characteristics of several applications. Paper presented at: 2005 International Conference on Parallel Processing (ICPP); 2005; Oslo, Norway.
- 3Brightwell R, Pedretti K, Ferreira K. Instrumentation and analysis of MPI queue times on the SeaStar high-performance network. In: Proceedings of the 17th International Conference on Computer Communications and Networks (ICCCN); 2008; St. Thomas, VI.
- 4Keller R, Graham RL. Characteristics of the unexpected message queue of MPI applications. Paper presented at: European MPI Users' Group Meeting; 2010; Stuttgart, Germany.
- 5Zounmevo JA, Afsahi A. A fast and resource-conscious MPI message queue mechanism for large-scale jobs. Futur Gener Comput Syst. 2014; 30: 265-290.
- 6Flajslik M, Dinan J, Underwood KD. Mitigating MPI message matching misery. In: High Performance Computing 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings. Cham, Switzerland: Springer International Publishing Switzerland; 2016: 281-299.
10.1007/978-3-319-41321-1_15 Google Scholar
- 7Barrett BW, Brightwell R, Grant RE, Hammond SD, Hemmert KS. An evaluation of MPI message rate on hybrid-core processors. Int J High Perform Comput Appl. 2014; 28(4): 415-424.
- 8Geoffray P. Myrinet express (MX): is your interconnect smart? In: Proceedings of the High Performance Computing and Grid in Asia Pacific Region, Seventh International Conference (HPCASIA); 2004; Tokyo, Japan.
- 9Petrini F, Feng W-c, Hoisie A, Coll S, Frachtenberg E. The Quadrics network: high-performance clustering technology. IEEE Micro. 2002; 22(1): 46-57.
- 10Brightwell R, Underwood KD. An analysis of NIC resource usage for offloading MPI. Paper presented at: 18th International Parallel and Distributed Processing Symposium (IPDPS); 2004; Santa Fe, NM.
- 11Underwood KD, Hemmert KS, Rodrigues A, Murphy R, Brightwell R. A hardware acceleration unit for MPI queue processing. Paper presented at: 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2005; Denver, CO.
- 12Hemmert KS, Underwood KD, Rodrigues A. An architecture to perform NIC based MPI matching. Paper presented at: 2007 IEEE International Conference on Cluster Computing; 2007; Austin, TX.
- 13 Understanding MPI tag matching and rendezvous offloads (ConnectX-5). https://community.mellanox.com/docs/DOC-2583. Accessed July 25, 2018.
- 14Derradji S, Palfer-Sollier T, Panziera JP, Poudes A, Atos FW. The BXI interconnect architecture. In: Proceedings of the 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI); 2015; Santa Clara, CA.
- 15Barrett BW, Brightwell R, Grant RE, et al. The Portals 4.1 Networking Programming Interface. Sandia Report SAND2017-3825. Albuquerque, NM: Sandia National Laboratories; 2017.
10.2172/1365498 Google Scholar
- 16Brightwell R, Pedretti KT, Underwood KD, Hudson T. SeaStar interconnect: balanced bandwidth for scalable performance. IEEE Micro. 2006; 26(3): 41-57.
- 17Riesen R, Brightwell R, Pedretti K, Maccabe AB, Hudson T. The Portals 3.3 Message Passing Interface Document Revision 2.1. Sandia Report SAND2006-0420. Albuquerque, NM: Sandia National Laboratories; 2006.
- 18Boden NJ, Cohen D, Felderman RE, et al. Myrinet: a gigabit-per-second local area network. IEEE Micro. 1995; 15(1): 29-36.
- 19 The Open MPI Development Team. Open MPI. https://www.open-mpi.org. 2017. Accessed March 28, 2017.
- 20 Team MPICH Development. MPICH. https://www.mpich.org. 2017. Accessed March 30, 2017.
- 21Hoefler T, Schneider T, Lumsdaine A. LogGOPSim: simulating large-scale applications in the LogGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC); 2010; Chicago, IL.
- 22Graham R, Woodall T, Squyres J. Open MPI: a flexible high performance MPI. Parallel Processing and Applied Mathematics. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2006: 228-239.
10.1007/11752578_29 Google Scholar
- 23Gropp W, Lusk E, Doss N, Skjellum A. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 1996; 22(6): 789-828.
- 24 Cray XC40 programming environment user's guide. https://pubs.cray.com/pdf-attachments/attachment?pubId=00463350-DA&attachmentId=pub_00463350-DA.pdf. Accessed July 25, 2018
- 25Alverson B, Froese E, Kaplan L, Roweth D. Cray® XC™ Series Network. White Paper WP-Aries01-1112. Cray Inc; 2012.
- 26 Message Passing Interface Forum. MPI: a message-passing interface standard version 3.1. 2015. http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
- 27Keppitiyagama C, Wagner A. Asynchronous MPI messaging on Myrinet. In: Proceedings 15th International Parallel and Distributed Processing Symposium (IPDPS); 2001; San Francisco, CA.
- 28Culler D, Karp R, Patterson D, et al. LogP: towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP); 1993; San Diego, CA.
- 29Hoefler T, Schneider T, Lumsdaine A. Characterizing the influence of system noise on large-scale applications by simulation. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC); 2010; New Orleans, LA.
- 30Levy S, Topp B, Ferreira KB, Arnold D, Hoefler T, Widener P. Using simulation to evaluate the performance of resilience strategies at scale. Paper presented at: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems; 2013; Denver, CO.
- 31Levy S, Ferreira KB, Widener P, Bridges PG, Mondragon OH. How I learned to stop worrying and love in Situ analytics: leveraging latent synchronization in MPI collective algorithms. In: Proceedings of the 23rd European MPI Users' Group Meeting ACM (EuroMPI); 2016; Edinburgh, UK.
- 32Mondragon OH, Bridges PG, Levy S, Ferreira KB, Widener P. Understanding performance interference in next-generation HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC); 2016; Salt Lake City, UT.
- 33Brightwell R, Skjellum A. MPICH on the T3D: a case study of high performance message passing. In: Proceedings of the Second MPI Developers Conference (MPIDC); 1996; Notre Dame, IN.
- 34Ghazimirsaeed SM, Grant RE, Afsahi A. A dedicated message matching mechanism for collective communications. In: Proceedings of the 47th International Conference on Parallel Processing Companion (ICPP); 2018; Eugene, OR.
- 35 Cray XC40 specifications. https://www.cray.com/sites/default/files/resources/cray_xc40_specifications.pdf. Accessed July 25, 2018.
- 36Plimpton S. Fast parallel algorithms for short-range molecular dynamics. J Comput Phys. 1995; 117(1): 1-19.
- 37 Sandia National Laboratories. LAMMPS molecular dynamics simulator. 2013. http://lammps.sandia.gov
- 38 Lawrence Livermore National Laboratory. Co-design at Lawrence Livermore National Laboratory: Livermore unstructured lagrangian explicit shock hydrodynamics (LULESH). 2015. http://codesign.llnl.gov/lulesh.php. Retrieved June 10, 2015.
- 39 Indiana University. HPCG benchmark. http://www.hpcg-benchmark.org. Retrieved September 2017.
- 40 Laboratories Sandia National, Tennessee Knoxville University. MIMD lattice computation (MILC) collaboration. 2017. http://physics.indiana.edu/~sg/milc.html. Retrieved September 2017.
- 41Adams M, Ethier S, Wichmann N. Performance of particle in cell methods on highly concurrent computational architectures. J Phys Conf Ser. 2007. Article ID 012001.
- 42 Mellanox. Fabric collective accelerator. Retrieved September 2007. http://www.mellanox.com/related-docs/prod_acceleration_software/FCA.pdf
- 43 Mellanox. Scalable hierarchical aggregation protocol. Retrieved September 2007. http://www.mellanox.com/related-docs/prod_acceleration_software/Mellanox_SHARP_SW_API_Guide.pdf
- 44 Cisco. Cisco nexus 9000 series. 2017. https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/6-x/security/configuration/guide/b_Cisco_Nexus_9000_Series_NX-OS_Security_Configuration_Guide/b_Cisco_Nexus_9000_Series_NX-OS_Security_Configuration_Guide_chapter_01010.pdf
- 45Dang HV, Snir M, Gropp W. Eliminating contention bottlenecks in multithreaded MPI. Parallel Comput. 2017; 69: 1-23.
- 46Dinan J, Grant RE, Balaji P, et al. Enabling communication concurrency through flexible MPI endpoints. Int J High Perform Comput Appl. 2014; 28(4): 390-405.
- 47Ferreira KB, Levy S, Pedretti K, Grant RE. Characterizing MPI matching via trace-based simulation. Parallel Comput. 2018; 77: 57-83.
- 48Schonbein W, Dosanjh MG, Grant RE, Bridges PG. Measuring multithreaded message matching misery. In: Proceedings of the International European Conference on Parallel and Distributed Computing; 2018; Turin, Italy.
- 49Raffenetti K, Amer A, Oden L, et al. Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC); 2017; Denver, CO.
- 50Dosanjh MG, Ghazimirsaeed SM, Grant RE, et al. The case for semi-permanent cache occupancy: understanding the impact of data locality on network processing. In: Proceedings of the 47th International Conference on Parallel Processing (ICPP); 2018; Eugene, OR.
- 51Hjelm N, Dosanjh MG, Grant RE, Groves T, Bridges P, Arnold D. Improving MPI multi-threaded RMA communication performance. In: Proceedings of the 47th International Conference on Parallel Processing (ICPP); 2018; Eugene, OR.
- 52Bridges PG, Dosanjh MG, Grant R, Skjellum A, Farmer S, Brightwell R. Preparing for exascale: modeling MPI for many-core systems using fine-grain queues. In: Proceedings of the 3rd Workshop on Exascale MPI (ExaMPI); 2015; Austin, TX.
- 53Stark DT, Barrett RF, Grant RE, Olivier SL, Pedretti KT, Vaughan CT. Early experiences co-scheduling work and communication tasks for hybrid MPI+X applications. Paper presented at: 2014 Workshop on Exascale MPI at Supercomputing Conference; 2014; New Orleans, LA.
- 54Barrett RF, Stark DT, Vaughan CT, Grant RE, Olivier SL, Pedretti KT. Toward an evolutionary task parallel integrated MPI+X programming model. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM); 2015; San Francisco, CA.
- 55Dosanjh MG, Grant RE, Bridges PG, Brightwell R. Re-evaluating network onload vs. offload for the many-core era. Paper presented at: 2015 IEEE International Conference on Cluster Computing; 2015; Chicago, IL.
- 56Schneider T, Hoefler T, Grant RE, Barrett B, Brightwell R. Protocols for fully offloaded collective operations on accelerated network adapters. Paper presented at: 2013 42nd International Conference on Parallel Processing; 2013; Lyon, France.
- 57Hoefler T, Di Girolamo S, Taranov K, Grant RE, Brightwell R. spin: high-performance streaming processing in the network. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC); 2017; Denver, CO.