Concurrency and Computation: Practice and Experience

Volume 32, Issue 3 e5150

SPECIAL ISSUE PAPER

Hardware MPI message matching: Insights into MPI matching behavior to inform design

Kurt Ferreira,

Kurt Ferreira

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Search for more papers by this author

Ryan E. Grant,

Ryan E. Grant

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Search for more papers by this author

Michael J. Levenhagen,

Michael J. Levenhagen

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Search for more papers by this author

Scott Levy,

Corresponding Author

Scott Levy

[email protected]

orcid.org/0000-0002-2232-3201

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Scott Levy, Center for Computing Research, Sandia National Laboratories, PO Box 5800, Albuquerque, NM 87185.

Email: [email protected]

Search for more papers by this author

Taylor Groves,

Taylor Groves

National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, California

Search for more papers by this author

Kurt Ferreira,

Kurt Ferreira

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Search for more papers by this author

Ryan E. Grant,

Ryan E. Grant

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Search for more papers by this author

Michael J. Levenhagen,

Michael J. Levenhagen

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Search for more papers by this author

Scott Levy,

Corresponding Author

Scott Levy

[email protected]

orcid.org/0000-0002-2232-3201

Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico

Scott Levy, Center for Computing Research, Sandia National Laboratories, PO Box 5800, Albuquerque, NM 87185.

Email: [email protected]

Search for more papers by this author

Taylor Groves,

Taylor Groves

National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, California

Search for more papers by this author

First published: 27 February 2019

https://doi.org/10.1002/cpe.5150

Citations: 7

Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc, for the United States Department of Energy's National Nuclear Security Administration under Contract No. DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

This manuscript has been authored by an author at Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the United States Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

Share a link

Email
Wechat
Bluesky

Summary

This paper explores key differences of MPI match lists for several important United States Department of Energy (DOE) applications and proxy applications. This understanding is critical in determining the most promising hardware matching design for any given high-speed network. The results of MPI match list studies for the major open-source MPI implementations, MPICH and Open MPI, are presented, and we modify an MPI simulator, LogGOPSim, to provide match list statistics. These results are discussed in the context of several different potential design approaches to MPI matching–capable hardware. The data illustrate the requirements for different hardware designs in terms of performance and memory capacity. This paper's contributions are the collection and analysis of data to help inform hardware designers of common MPI requirements and highlight the difficulties in determining these requirements by only examining a single MPI implementation.

REFERENCES

1Underwood KD, Brightwell R. The impact of MPI queue usage on message latency. Paper presented at: International Conference on Parallel Processing (ICPP); 2004; Montreal, Canada.
Google Scholar
2Brightwell R, Goudy S, Underwood K. A preliminary analysis of the MPI queue characteristics of several applications. Paper presented at: 2005 International Conference on Parallel Processing (ICPP); 2005; Oslo, Norway.
Google Scholar
3Brightwell R, Pedretti K, Ferreira K. Instrumentation and analysis of MPI queue times on the SeaStar high-performance network. In: Proceedings of the 17th International Conference on Computer Communications and Networks (ICCCN); 2008; St. Thomas, VI.
Google Scholar
4Keller R, Graham RL. Characteristics of the unexpected message queue of MPI applications. Paper presented at: European MPI Users' Group Meeting; 2010; Stuttgart, Germany.
Google Scholar
5Zounmevo JA, Afsahi A. A fast and resource-conscious MPI message queue mechanism for large-scale jobs. Futur Gener Comput Syst. 2014; 30: 265-290.
10.1016/j.future.2013.07.003
Web of Science® Google Scholar
6Flajslik M, Dinan J, Underwood KD. Mitigating MPI message matching misery. In: High Performance Computing 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings. Cham, Switzerland: Springer International Publishing Switzerland; 2016: 281-299.
10.1007/978-3-319-41321-1_15
Google Scholar
7Barrett BW, Brightwell R, Grant RE, Hammond SD, Hemmert KS. An evaluation of MPI message rate on hybrid-core processors. Int J High Perform Comput Appl. 2014; 28(4): 415-424.
10.1177/1094342014552085
Web of Science® Google Scholar
8Geoffray P. Myrinet express (MX): is your interconnect smart? In: Proceedings of the High Performance Computing and Grid in Asia Pacific Region, Seventh International Conference (HPCASIA); 2004; Tokyo, Japan.
Google Scholar
9Petrini F, Feng W-c, Hoisie A, Coll S, Frachtenberg E. The Quadrics network: high-performance clustering technology. IEEE Micro. 2002; 22(1): 46-57.
10.1109/40.988689
Web of Science® Google Scholar
10Brightwell R, Underwood KD. An analysis of NIC resource usage for offloading MPI. Paper presented at: 18th International Parallel and Distributed Processing Symposium (IPDPS); 2004; Santa Fe, NM.
Google Scholar
11Underwood KD, Hemmert KS, Rodrigues A, Murphy R, Brightwell R. A hardware acceleration unit for MPI queue processing. Paper presented at: 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2005; Denver, CO.
Google Scholar
12Hemmert KS, Underwood KD, Rodrigues A. An architecture to perform NIC based MPI matching. Paper presented at: 2007 IEEE International Conference on Cluster Computing; 2007; Austin, TX.
Google Scholar
13 Understanding MPI tag matching and rendezvous offloads (ConnectX-5). https://community.mellanox.com/docs/DOC-2583. Accessed July 25, 2018.
Google Scholar
14Derradji S, Palfer-Sollier T, Panziera JP, Poudes A, Atos FW. The BXI interconnect architecture. In: Proceedings of the 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI); 2015; Santa Clara, CA.
Google Scholar
15Barrett BW, Brightwell R, Grant RE, et al. The Portals 4.1 Networking Programming Interface. Sandia Report SAND2017-3825. Albuquerque, NM: Sandia National Laboratories; 2017.
10.2172/1365498
Google Scholar
16Brightwell R, Pedretti KT, Underwood KD, Hudson T. SeaStar interconnect: balanced bandwidth for scalable performance. IEEE Micro. 2006; 26(3): 41-57.
10.1109/MM.2006.65
Web of Science® Google Scholar
17Riesen R, Brightwell R, Pedretti K, Maccabe AB, Hudson T. The Portals 3.3 Message Passing Interface Document Revision 2.1. Sandia Report SAND2006-0420. Albuquerque, NM: Sandia National Laboratories; 2006.
Google Scholar
18Boden NJ, Cohen D, Felderman RE, et al. Myrinet: a gigabit-per-second local area network. IEEE Micro. 1995; 15(1): 29-36.
10.1109/40.342015
Web of Science® Google Scholar
19 The Open MPI Development Team. Open MPI. https://www.open-mpi.org. 2017. Accessed March 28, 2017.
Google Scholar
20 Team MPICH Development. MPICH. https://www.mpich.org. 2017. Accessed March 30, 2017.
Google Scholar
21Hoefler T, Schneider T, Lumsdaine A. LogGOPSim: simulating large-scale applications in the LogGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC); 2010; Chicago, IL.
Google Scholar
22Graham R, Woodall T, Squyres J. Open MPI: a flexible high performance MPI. Parallel Processing and Applied Mathematics. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2006: 228-239.
10.1007/11752578_29
Google Scholar
23Gropp W, Lusk E, Doss N, Skjellum A. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 1996; 22(6): 789-828.
10.1016/0167-8191(96)00024-5
Web of Science® Google Scholar
24 Cray XC40 programming environment user's guide. https://pubs.cray.com/pdf-attachments/attachment?pubId=00463350-DA&attachmentId=pub_00463350-DA.pdf. Accessed July 25, 2018
Google Scholar
25Alverson B, Froese E, Kaplan L, Roweth D. Cray® XC™ Series Network. White Paper WP-Aries01-1112. Cray Inc; 2012.
Google Scholar
26 Message Passing Interface Forum. MPI: a message-passing interface standard version 3.1. 2015. http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
Google Scholar
27Keppitiyagama C, Wagner A. Asynchronous MPI messaging on Myrinet. In: Proceedings 15th International Parallel and Distributed Processing Symposium (IPDPS); 2001; San Francisco, CA.
Google Scholar
28Culler D, Karp R, Patterson D, et al. LogP: towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP); 1993; San Diego, CA.
Google Scholar
29Hoefler T, Schneider T, Lumsdaine A. Characterizing the influence of system noise on large-scale applications by simulation. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC); 2010; New Orleans, LA.
Google Scholar
30Levy S, Topp B, Ferreira KB, Arnold D, Hoefler T, Widener P. Using simulation to evaluate the performance of resilience strategies at scale. Paper presented at: International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems; 2013; Denver, CO.
Google Scholar
31Levy S, Ferreira KB, Widener P, Bridges PG, Mondragon OH. How I learned to stop worrying and love in Situ analytics: leveraging latent synchronization in MPI collective algorithms. In: Proceedings of the 23rd European MPI Users' Group Meeting ACM (EuroMPI); 2016; Edinburgh, UK.
Google Scholar
32Mondragon OH, Bridges PG, Levy S, Ferreira KB, Widener P. Understanding performance interference in next-generation HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC); 2016; Salt Lake City, UT.
Google Scholar
33Brightwell R, Skjellum A. MPICH on the T3D: a case study of high performance message passing. In: Proceedings of the Second MPI Developers Conference (MPIDC); 1996; Notre Dame, IN.
Google Scholar
34Ghazimirsaeed SM, Grant RE, Afsahi A. A dedicated message matching mechanism for collective communications. In: Proceedings of the 47th International Conference on Parallel Processing Companion (ICPP); 2018; Eugene, OR.
Google Scholar
35 Cray XC40 specifications. https://www.cray.com/sites/default/files/resources/cray_xc40_specifications.pdf. Accessed July 25, 2018.
Google Scholar
36Plimpton S. Fast parallel algorithms for short-range molecular dynamics. J Comput Phys. 1995; 117(1): 1-19.
10.1006/jcph.1995.1039
CAS Web of Science® Google Scholar
37 Sandia National Laboratories. LAMMPS molecular dynamics simulator. 2013. http://lammps.sandia.gov
Google Scholar
38 Lawrence Livermore National Laboratory. Co-design at Lawrence Livermore National Laboratory: Livermore unstructured lagrangian explicit shock hydrodynamics (LULESH). 2015. http://codesign.llnl.gov/lulesh.php. Retrieved June 10, 2015.
Google Scholar
39 Indiana University. HPCG benchmark. http://www.hpcg-benchmark.org. Retrieved September 2017.
Google Scholar
40 Laboratories Sandia National, Tennessee Knoxville University. MIMD lattice computation (MILC) collaboration. 2017. http://physics.indiana.edu/~sg/milc.html. Retrieved September 2017.
Google Scholar
41Adams M, Ethier S, Wichmann N. Performance of particle in cell methods on highly concurrent computational architectures. J Phys Conf Ser. 2007. Article ID 012001.
10.1088/1742-6596/78/1/012001
PubMed Google Scholar
42 Mellanox. Fabric collective accelerator. Retrieved September 2007. http://www.mellanox.com/related-docs/prod_acceleration_software/FCA.pdf
Google Scholar
43 Mellanox. Scalable hierarchical aggregation protocol. Retrieved September 2007. http://www.mellanox.com/related-docs/prod_acceleration_software/Mellanox_SHARP_SW_API_Guide.pdf
Google Scholar
44 Cisco. Cisco nexus 9000 series. 2017. https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/6-x/security/configuration/guide/b_Cisco_Nexus_9000_Series_NX-OS_Security_Configuration_Guide/b_Cisco_Nexus_9000_Series_NX-OS_Security_Configuration_Guide_chapter_01010.pdf
Google Scholar
45Dang HV, Snir M, Gropp W. Eliminating contention bottlenecks in multithreaded MPI. Parallel Comput. 2017; 69: 1-23.
10.1016/j.parco.2017.08.003
Web of Science® Google Scholar
46Dinan J, Grant RE, Balaji P, et al. Enabling communication concurrency through flexible MPI endpoints. Int J High Perform Comput Appl. 2014; 28(4): 390-405.
10.1177/1094342014548772
Web of Science® Google Scholar
47Ferreira KB, Levy S, Pedretti K, Grant RE. Characterizing MPI matching via trace-based simulation. Parallel Comput. 2018; 77: 57-83.
10.1016/j.parco.2018.05.005
Web of Science® Google Scholar
48Schonbein W, Dosanjh MG, Grant RE, Bridges PG. Measuring multithreaded message matching misery. In: Proceedings of the International European Conference on Parallel and Distributed Computing; 2018; Turin, Italy.
Google Scholar
49Raffenetti K, Amer A, Oden L, et al. Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC); 2017; Denver, CO.
Google Scholar
50Dosanjh MG, Ghazimirsaeed SM, Grant RE, et al. The case for semi-permanent cache occupancy: understanding the impact of data locality on network processing. In: Proceedings of the 47th International Conference on Parallel Processing (ICPP); 2018; Eugene, OR.
Google Scholar
51Hjelm N, Dosanjh MG, Grant RE, Groves T, Bridges P, Arnold D. Improving MPI multi-threaded RMA communication performance. In: Proceedings of the 47th International Conference on Parallel Processing (ICPP); 2018; Eugene, OR.
Google Scholar
52Bridges PG, Dosanjh MG, Grant R, Skjellum A, Farmer S, Brightwell R. Preparing for exascale: modeling MPI for many-core systems using fine-grain queues. In: Proceedings of the 3rd Workshop on Exascale MPI (ExaMPI); 2015; Austin, TX.
Google Scholar
53Stark DT, Barrett RF, Grant RE, Olivier SL, Pedretti KT, Vaughan CT. Early experiences co-scheduling work and communication tasks for hybrid MPI+X applications. Paper presented at: 2014 Workshop on Exascale MPI at Supercomputing Conference; 2014; New Orleans, LA.
Google Scholar
54Barrett RF, Stark DT, Vaughan CT, Grant RE, Olivier SL, Pedretti KT. Toward an evolutionary task parallel integrated MPI+X programming model. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM); 2015; San Francisco, CA.
Google Scholar
55Dosanjh MG, Grant RE, Bridges PG, Brightwell R. Re-evaluating network onload vs. offload for the many-core era. Paper presented at: 2015 IEEE International Conference on Cluster Computing; 2015; Chicago, IL.
Google Scholar
56Schneider T, Hoefler T, Grant RE, Barrett B, Brightwell R. Protocols for fully offloaded collective operations on accelerated network adapters. Paper presented at: 2013 42nd International Conference on Parallel Processing; 2013; Lyon, France.
Google Scholar
57Hoefler T, Di Girolamo S, Taranov K, Grant RE, Brightwell R. spin: high-performance streaming processing in the network. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC); 2017; Denver, CO.
Google Scholar

Citing Literature

Volume32, Issue3

Special Issue on Graph Computing (GRAPH 2017) and Special Issue of the Workshop on Exascale MPI (EXAMPI2017)

10 February 2020

e5150

Hardware MPI message matching: Insights into MPI matching behavior to inform design

Summary

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Hardware MPI message matching: Insights into MPI matching behavior to inform design

Summary

REFERENCES

Citing Literature

References

Related

Information