Concurrency and Computation: Practice and Experience

Volume 32, Issue 3 e5158

SPECIAL ISSUE PAPER

Tail queues: A multi-threaded matching architecture

Matthew G.F. Dosanjh,

Corresponding Author

Matthew G.F. Dosanjh

[email protected]

orcid.org/0000-0001-5141-9176

Center for Computing Research, Sandia National Laboratories^∗, Albuquerque, New Mexico

Department of Computer Science, University of New Mexico, Albuquerque, New Mexico

Matthew G.F. Dosanjh, Center for Computing Research, Sandia National Laboratories, Albuquerque, NM; or Department of Computer Science, University of New Mexico, Albuquerque, NM.

Email: [email protected]

Search for more papers by this author

Ryan E. Grant,

Ryan E. Grant

Center for Computing Research, Sandia National Laboratories^∗, Albuquerque, New Mexico

Department of Computer Science, University of New Mexico, Albuquerque, New Mexico

Search for more papers by this author

Whit Schonbein,

Whit Schonbein

Center for Computing Research, Sandia National Laboratories^∗, Albuquerque, New Mexico

Department of Computer Science, University of New Mexico, Albuquerque, New Mexico

Search for more papers by this author

Patrick G. Bridges,

Patrick G. Bridges

Department of Computer Science, University of New Mexico, Albuquerque, New Mexico

Search for more papers by this author

Matthew G.F. Dosanjh,

Corresponding Author

Matthew G.F. Dosanjh

[email protected]

orcid.org/0000-0001-5141-9176

Center for Computing Research, Sandia National Laboratories^∗, Albuquerque, New Mexico

Department of Computer Science, University of New Mexico, Albuquerque, New Mexico

Matthew G.F. Dosanjh, Center for Computing Research, Sandia National Laboratories, Albuquerque, NM; or Department of Computer Science, University of New Mexico, Albuquerque, NM.

Email: [email protected]

Search for more papers by this author

Ryan E. Grant,

Ryan E. Grant

Center for Computing Research, Sandia National Laboratories^∗, Albuquerque, New Mexico

Department of Computer Science, University of New Mexico, Albuquerque, New Mexico

Search for more papers by this author

Whit Schonbein,

Whit Schonbein

Center for Computing Research, Sandia National Laboratories^∗, Albuquerque, New Mexico

Department of Computer Science, University of New Mexico, Albuquerque, New Mexico

Search for more papers by this author

Patrick G. Bridges,

Patrick G. Bridges

Department of Computer Science, University of New Mexico, Albuquerque, New Mexico

Search for more papers by this author

First published: 06 February 2019

https://doi.org/10.1002/cpe.5158

Citations: 5

^∗Sandia National Laboratories is a multimission laboratory managed and operated by the National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Share a link

Email
Wechat
Bluesky

Summary

As we approach exascale, computational parallelism will have to drastically increase in order to meet throughput targets. Many-core architectures have exacerbated this problem by trading reduced clock speeds, core complexity, and computation throughput for increasing parallelism. This presents two major challenges for communication libraries such as MPI: the library must leverage the performance advantages of thread level parallelism and avoid the scalability problems associated with increasing the number of processes to that scale. Hybrid programming models, such as MPI+X, have been proposed to address these challenges. MPI THREAD MULTIPLE is MPI's thread safe mode. While there has been work to optimize it, it largely remains non-performant in most implementations. While current applications avoid MPI multithreading due to performance concerns, it is expected to be utilized in future applications. One of the major synchronous data structures required by MPI is the matching engine. In this paper, we present a parallel matching algorithm that can improve MPI matching for multithreaded applications. We then perform a feasibility study to demonstrate the performance benefit of the technique.

REFERENCES

1 Los Alamos National Laboratory. The Trinity Advanced Technology System. http://www.lanl.gov/projects/trinity/. Accessed March 19, 2015.
Google Scholar
2 NERSC. NERSC-8 System: Cori. https://www.nersc.gov/users/computational-systems/cori/. Accessed March 19, 2015.
Google Scholar
3Barrett BW, Brightwell R, Grant RE, Hammond SD, Hemmert KS. An evaluation of MPI message rate on hybrid-core processors. Int J High Perform Comput Appl. 2014; 28(4): 415-424.
10.1177/1094342014552085
Web of Science® Google Scholar
4MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.1. June 4th, 2015.
Google Scholar
5Klenk B, Fröning H, Eberle H, Dennison L. Relaxations for high-performance message passing on massively parallel SIMT processors. Paper presented at: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2017; Orlando, FL.
Google Scholar
6Gropp W, Lusk E, Doss N, Skjellum A. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 1996; 22(6): 789-828.
10.1016/0167-8191(96)00024-5
Web of Science® Google Scholar
7MPI Forum. MPI: A message-passing interface standard version 3.0. Knoxville, TN: University of Tennessee; 2012.
Google Scholar
8Baker AH, Falgout RD, Kolev TV, Yang UM. Multigrid smoothers for ultraparallel computing. SIAM J Sci Comput. 2011; 33(5): 2864-2887.
10.1137/100798806
Web of Science® Google Scholar
9Team MPICH Development. CH4. Last accessed January 4, 2017. 2016.
Google Scholar
10Panda DK, Tomko K, Schulz K, Majumdar A. The MVAPICH project: evolution and sustainability of an open source production quality MPI library for HPC. Paper presented at: Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with International Conference on Supercomputing (WSSPE); 2013; Columbus, Canada.
Google Scholar
11Ferreira KB, Levy S, Pedretti K, Grant RE. Characterizing MPI matching via trace-based simulation. In: EuroMPI'17 Proceedings of the 24th European MPI Users' Group Meeting; 2017; Chicago, IL.
Google Scholar
12Graham R, Woodall T, Squyres J. Open MPI: a flexible high performance MPI. In: PPAM'05 Proceedings of the 6th International Conference on Parallel Processing and Applied Mathematics; 2006; Poznań, Poland.
Google Scholar
13Alvin K, Barrett B, Brightwell R, et al. On the path to exascale. Int J Distributed Syst Technol. 2012; 1(2): 1-22.
Google Scholar
14Pedretti KT, Brightwell R, Doerfler D, Hemmert KS, Laros JH III. The impact of injection bandwidth performance on application scalability. In: EuroMPI'11 Proceedings of the 18th European MPI Users' Group Conference on Recent Advances in the Message Passing Interface; 2011; Santorini, Greece.
Google Scholar
15Roweth D, Atkins M, McMahon K. The Cray® XCTM Series Scalability Advantage.
Google Scholar
16Kale LV, Krishnan S. CHARM++: a portable concurrent object oriented system based on C++. In: OOPSLA'93 Proceedings of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications; 1993; Washington, DC.
Google Scholar
17Team OPENMPI Development. OPENMPI. Last accessed March 28, 2017. 2017.
Google Scholar
18Flajslik M, Dinan J, Underwood KD. Mitigating MPI message matching misery. Paper presented at: International Conference on High Performance Computing; 2016; Sydney, Australia.
Google Scholar
19Mellor-Crummey JM, Scott ML. Synchronization without contention. ACM SIGPLAN Notices. 1991; 26(4): 269-278.
10.1145/106973.106999
Google Scholar
20Doefler D, Barrett BW. Sandia MPI MicroBenchmark Suite (SMB). Technical Report. Albuquerque, NM: Sandia National Laboratories; 2009.
Google Scholar
21Sridharan S, Dinan J, Kalamkar DD. Enabling efficient multithreaded MPI communication through a library-based implementation of MPI endpoints. In: SC'14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014; New Orleans, LA.
Google Scholar
22Schonbein W, Dosanjh MG, Grant RE, Bridges PG. Measuring multithreaded message matching misery. Paper presented at: European Conference on Parallel Processing; 2018; Turin, Italy.
Google Scholar
23Sodani A. Knights Landing (KNL): 2nd generation Intel® Xeon Phi processor. Paper presented at: 2015 IEEE Hot Chips 27 Symposium (HCS); 2015; Cupertino, CA.
Google Scholar
24Gropp W, Thakur R. Thread-safety in an MPI implementation: requirements and analysis. Parallel Comput. 2007; 33(9): 595-604.
10.1016/j.parco.2007.07.002
Web of Science® Google Scholar
25Hoefler T, Bronevetsky G, Barrett B, De Supinski BR, Lumsdaine A. Efficient MPI support for advanced hybrid programming models. In: EuroMPI'10 Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface; 2010; Stuttgart, Germany.
Google Scholar
26Balaji P, Buntinas D, Goodell D, Gropp W, Thakur R. Fine-grained multithreading support for hybrid threaded MPI programming. Int J High Perform Comput Appl. 2010; 24(1): 49-57.
10.1177/1094342009360206
Web of Science® Google Scholar
27Dimitrov R, Skjellum A. Software architecture and performance comparison of MPI/Pro and MPICH. In: ICCS'03 Proceedings of the 2003 International Conference on Computational Science: Part III; 2003; Melbourne, Australia.
Google Scholar
28Adiga NR, Almási G, Almasi GS, et al. An overview of the BlueGene/L supercomputer. In: SC'02 Proceedings of the 2002 ACM/IEEE Conference on Supercomputing; 2002; Baltimore, MD.
Google Scholar
29Amer A, Lu H, Wei Y, Balaji P, Matsuoka S. MPI+ threads: runtime contention and remedies. ACM SIGPLAN Notices. 2015; 50(8): 239-248.
10.1145/2858788.2688522
Google Scholar
30Vaidyanathan K, Kalamkar DD, Pamnany K, et al. Improving concurrency and asynchrony in multithreaded MPI applications using software offloading. In: SC'15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2015; Austin, TX.
Google Scholar
31Stark DT, Barrett RF, Grant RE, Olivier SL, Pedretti KT, Vaughan CT. Early experiences co-scheduling work and communication tasks for hybrid MPI+X applications. Paper presented at: 2014 Workshop on Exascale MPI at Supercomputing Conference; 2014; New Orleans, LA.
Google Scholar
32Barrett RF, Stark DT, Vaughan CT, Grant RE, Olivier SL, Pedretti KT. Toward an evolutionary task parallel integrated MPI+X programming model. In: PMAM'15 Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores; 2015; San Francisco, CA.
Google Scholar
33Kamal H, Wagner A. FG-MPI: fine-grain MPI for multicore and clusters. Paper presented at: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW); 2010; Atlanta, GA.
Google Scholar
34Dinan J, Grant RE, Balaji P, et al. Enabling communication concurrency through flexible MPI endpoints. Int J High Perform Comput Appl. 2014; 28(4): 390-405.
10.1177/1094342014548772
Web of Science® Google Scholar
35Grant R, Skjellum A, Bangalore PV. Lightweight threading With MPI Using Persistent Communications Semantics. Albuquerque, NM: Sandia National Laboratories (SNL-NM); 2015.
Google Scholar
36Zounmevo JA, Afsahi A. A fast and resource-conscious MPI message queue mechanism for large-scale jobs. Future Gener Comput Syst. 2014; 30: 265-290.
10.1016/j.future.2013.07.003
Web of Science® Google Scholar
37Rodrigues A, Murphy R, Brightwell R, Underwood KD. Enhancing NIC performance for MPI using processing-in-memory. Paper presented at: 19th IEEE International Parallel and Distributed Processing Symposium; 2005; Denver, CO.
Google Scholar
38Bayatpour M, Subramoni H, Chakraborty S, Panda DK. Adaptive and dynamic design for MPI tag matching. Paper presented at: 2016 IEEE International Conference on Cluster Computing (CLUSTER); 2016; Taipei, Taiwan.
Google Scholar
39Nickolls J, Buck I, Garland M, Skadron K. Scalable parallel programming with CUDA. Paper presented at: 2008 IEEE Hot Chips 20 Symposium (HCS); 2008; Stanford, CA.
Google Scholar
40Klenk B, Fröning H. An overview of MPI characteristics of exascale proxy applications. Paper presented at: International Supercomputing Conference; 2017; Chicago, IL.
Google Scholar

Citing Literature

Volume32, Issue3

Special Issue on Graph Computing (GRAPH 2017) and Special Issue of the Workshop on Exascale MPI (EXAMPI2017)

10 February 2020

e5158

Tail queues: A multi-threaded matching architecture

Summary

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Tail queues: A multi-threaded matching architecture

Summary

REFERENCES

Citing Literature

References

Related

Information