Concurrency and Computation: Practice and Experience

Volume 32, Issue 3 e4467

SPECIAL ISSUE PAPER

Optimizing point-to-point communication between adaptive MPI endpoints in shared memory

Sam White,

Corresponding Author

Sam White

[email protected]

orcid.org/0000-0002-6019-8763

Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA

Sam White, Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA.

Email: [email protected]

Search for more papers by this author

Laxmikant V. Kale,

Laxmikant V. Kale

Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA

Search for more papers by this author

Sam White,

Corresponding Author

Sam White

[email protected]

orcid.org/0000-0002-6019-8763

Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA

Sam White, Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA.

Email: [email protected]

Search for more papers by this author

Laxmikant V. Kale,

Laxmikant V. Kale

Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA

Search for more papers by this author

First published: 12 March 2018

https://doi.org/10.1002/cpe.4467

Citations: 7

Share a link

Email
Wechat
Bluesky

Summary

Adaptive MPI is an implementation of the MPI standard that supports the virtualization of ranks as user-level threads, rather than OS processes. In this work, we optimize the communication performance of AMPI based on the locality of the endpoints communicating within a cluster of SMP nodes. We differentiate between point-to-point messages with both endpoints co-located on the same execution unit and point-to-point messages with both endpoints residing in the same process but not on the same execution unit. We demonstrate how the messaging semantics of Charm++ enable and hinder AMPI's implementation in different ways, and we motivate extensions to Charm++ to address the limitations. Using the OSU micro-benchmark suite, we show that our locality-aware design offers lower latency, higher bandwidth, and reduced memory footprint for applications.

References

1Hoefler T, Dinan J, Buntinas D, et al. Leveraging MPI'S one-sided communication interface for shared-memory programming. In: Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings. Berlin, Germany: Springer-Verlag Berlin, Heidelberg; 2012.
10.1007/978-3-642-33518-1_18
Google Scholar
2Hoefler T, Dinan J, Buntinas D, et al. MPI+MPI: a new hybrid approach to parallel programming with MPI plus shared memory. Computing. 2013; 95(12): 1121-1136.
10.1007/s00607-013-0324-2
Web of Science® Google Scholar
3Rabenseifner R, Hager G, Jost G. Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. Paper presented at: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing; 2009; Weimar, Germany.
Google Scholar
4Amer A, Lu H, Wei Y, Balaji P, Matsuoka S. MPI+Threads: runtime contention and remedies. ACM SIGPLAN Not. 2015; 50(8): 239-248.
10.1145/2858788.2688522
Google Scholar
5Dang H-V, Seo S, Amer A, Balaji P. Advanced thread synchronization for multithreaded MPI implementations. Paper presented at: CCGrid '17 Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing; 2017; Madrid, Spain.
Google Scholar
6Dang H-V, Snir M, Gropp W. Towards millions of communicating threads. Paper presented at: Proceedings of the 23rd European MPI Users' Group Meeting; 2016; Edinburgh, UK.
Google Scholar
7Huang C, Lawlor O, Kalé LV. Adaptive MPI. In: Languages and Compilers for Parallel Computing. College Station, Texas: Springer; 2003; 306-322.
Google Scholar
8Acun B, Gupta A, Jain N, et al. Parallel programming with migratable objects: Charm++ in practice. Paper presented at: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014; New Orleans, LA.
Google Scholar
9Dinan J, Grant RE, Balaji P, et al. Enabling communication concurrency through flexible MPI endpoints. Int J High Perform Comput Appl. 2014; 28(4): 390-405.
10.1177/1094342014548772
Web of Science® Google Scholar
10Huang C, Zheng G, Kumar S, Kalé LV. Performance evaluation of adaptive MPI. Paper presented at: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '06); 2006; New York, NY.
Google Scholar
11Jain N, Bhatele A, White S, Gamblin T, Kalé LV. Evaluating HPC networks via simulation of parallel workloads. Paper presented at: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16); 2016; Salt Lake City, UT.
Google Scholar
12Zheng G, Kakulapati G, Kalé LV. BigSim: a parallel simulator for performance prediction of extremely large parallel machines. Paper presented at: Proceedings of the 18th International Parallel and Distributed Processing Symposium; 2004; Santa Fe, NM.
Google Scholar
13Besnard J-B, Adam J, Shende S, et al. Introducing task-containers as an alternative to runtime-stacking. Paper presented at: Proceedings of the 23rd European MPI Users' Group Meeting (EuroMPI 2016); 2016; Edinburgh, UK.
Google Scholar
14Cho J-Y, Jin H-W, Nam D. Enhanced memory management for scalable MPI intra-node communication on many-core processor. Paper presented at: Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI '17); 2017; Chicago, IL.
Google Scholar
15Brightwell R, Pedretti K, Hudson T. SMARTMAP: operating system support for efficient data sharing among processes on a multi-core processor. Paper presented at: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC '08); 2008; Austin, TX.
Google Scholar
16Moreaud S, Goglin B, Namyst R, Goodell D. Optimizing MPI communication within large multicore nodes with kernel assistance. Paper presented at: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW); 2010; Atlanta, GA.
Google Scholar
17Chai L, Hartono A, Panda DK. Designing high performance and scalable MPI intra-node communication support for clusters. Paper presented at: 2006 IEEE International Conference on Cluster Computing; 2006; Barcelona, Spain.
Google Scholar
18Vienne J. Benefits of cross memory attach for MPI libraries on HPC clusters. Paper presented at: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment (XSEDE '14); 2014; Atlanta, GA.
Google Scholar
19Jin H-W, Sur S, Chai L, Panda D. LiMIC: support for high-performance MPI intra-node communication on Linux cluster. Paper presented at: Proceedings of the 2005 International Conference on Parallel (ICPP '05); 2005; Oslo, Norway.
Google Scholar
20Holmes D, Booth S. McMPI: A Managed-code MPI Library in Pure C#. Paper presented at: Proceedings of the 20th European MPI Users' Group Meeting (EuroMPI '13); 2013; Madrid, Spain.
Google Scholar
21Shimada A, Hori A, Ishikawa Y. Eliminating costs for crossing process boundary from MPI intra-node communication. Paper presented at: Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/ASIA '14); 2014; Kyoto, Japan.
Google Scholar
22Friedley A, Bronevetsky G, Hoefler T, Lumsdaine A. Hybrid MPI: efficient message passing for multi-core systems. Paper presented at: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13); 2013; Denver, CO.
Google Scholar
23Perache M, Carribault P, Jourdren H. MPC-MPI: an MPI implementation reducing the overall memory consumption. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface; 16th European PVM/MPI Users' Group Meeting, Espoo, Finland, September 7-10, 2009, Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2009: 94-103.
10.1007/978-3-642-03770-2_16
Google Scholar
24Perache M, Jourdren H, Namyst R. MPC: a unified parallel runtime for clusters of NUMA machines. In: Euro-Par 2008 Workshops - Parallel Processing: VHPC 2008, UNICORE 2008, HPPC 2008, SGS 2008, PROPER 2008, ROIA 2008, and DPA 2008, Las Palmas de Gran Canaria, Spain, August 25-26, 2008, Revised Selected Papers.Berlin, Germany: Springer-Verlag Berlin, Heidelberg; 2008.
10.1007/978-3-540-85451-7_9
Google Scholar
25 Quartz - Top500 List. https://www.top500.org/system/178971. Accessed November 13, 2017.
Google Scholar
26 Cori - Top500 List. https://www.top500.org/system/178924. Accessed November 13, 2017.
Google Scholar
27 OSU Micro-benchmarks suite. http://mvapich.cse.ohio-state.edu/benchmarks/. Accessed September 10, 2017.
Google Scholar
28McCalpin JD. Memory bandwidth and machine balance in current high performance computers. IEEE Comput Soc Tech Comm Comput Archit. 1995: 19-25.
Google Scholar

Citing Literature

Volume32, Issue3

Special Issue on Graph Computing (GRAPH 2017) and Special Issue of the Workshop on Exascale MPI (EXAMPI2017)

10 February 2020

e4467

Optimizing point-to-point communication between adaptive MPI endpoints in shared memory

Summary

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Optimizing point-to-point communication between adaptive MPI endpoints in shared memory

Summary

References

Citing Literature

References

Related

Information