Optimizing point-to-point communication between adaptive MPI endpoints in shared memory
Corresponding Author
Sam White
Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA
Sam White, Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA.
Email: [email protected]
Search for more papers by this authorLaxmikant V. Kale
Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA
Search for more papers by this authorCorresponding Author
Sam White
Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA
Sam White, Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA.
Email: [email protected]
Search for more papers by this authorLaxmikant V. Kale
Department of Computer Science, University of Illinois at Urbana-Champaign, IL 61801-2302, USA
Search for more papers by this authorSummary
Adaptive MPI is an implementation of the MPI standard that supports the virtualization of ranks as user-level threads, rather than OS processes. In this work, we optimize the communication performance of AMPI based on the locality of the endpoints communicating within a cluster of SMP nodes. We differentiate between point-to-point messages with both endpoints co-located on the same execution unit and point-to-point messages with both endpoints residing in the same process but not on the same execution unit. We demonstrate how the messaging semantics of Charm++ enable and hinder AMPI's implementation in different ways, and we motivate extensions to Charm++ to address the limitations. Using the OSU micro-benchmark suite, we show that our locality-aware design offers lower latency, higher bandwidth, and reduced memory footprint for applications.
References
- 1Hoefler T, Dinan J, Buntinas D, et al. Leveraging MPI'S one-sided communication interface for shared-memory programming. In: Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings. Berlin, Germany: Springer-Verlag Berlin, Heidelberg; 2012.
10.1007/978-3-642-33518-1_18 Google Scholar
- 2Hoefler T, Dinan J, Buntinas D, et al. MPI+MPI: a new hybrid approach to parallel programming with MPI plus shared memory. Computing. 2013; 95(12): 1121-1136.
- 3Rabenseifner R, Hager G, Jost G. Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. Paper presented at: 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing; 2009; Weimar, Germany.
- 4Amer A, Lu H, Wei Y, Balaji P, Matsuoka S. MPI+Threads: runtime contention and remedies. ACM SIGPLAN Not. 2015; 50(8): 239-248.
10.1145/2858788.2688522 Google Scholar
- 5Dang H-V, Seo S, Amer A, Balaji P. Advanced thread synchronization for multithreaded MPI implementations. Paper presented at: CCGrid '17 Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing; 2017; Madrid, Spain.
- 6Dang H-V, Snir M, Gropp W. Towards millions of communicating threads. Paper presented at: Proceedings of the 23rd European MPI Users' Group Meeting; 2016; Edinburgh, UK.
- 7Huang C, Lawlor O, Kalé LV. Adaptive MPI. In: Languages and Compilers for Parallel Computing. College Station, Texas: Springer; 2003; 306-322.
- 8Acun B, Gupta A, Jain N, et al. Parallel programming with migratable objects: Charm++ in practice. Paper presented at: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014; New Orleans, LA.
- 9Dinan J, Grant RE, Balaji P, et al. Enabling communication concurrency through flexible MPI endpoints. Int J High Perform Comput Appl. 2014; 28(4): 390-405.
- 10Huang C, Zheng G, Kumar S, Kalé LV. Performance evaluation of adaptive MPI. Paper presented at: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '06); 2006; New York, NY.
- 11Jain N, Bhatele A, White S, Gamblin T, Kalé LV. Evaluating HPC networks via simulation of parallel workloads. Paper presented at: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16); 2016; Salt Lake City, UT.
- 12Zheng G, Kakulapati G, Kalé LV. BigSim: a parallel simulator for performance prediction of extremely large parallel machines. Paper presented at: Proceedings of the 18th International Parallel and Distributed Processing Symposium; 2004; Santa Fe, NM.
- 13Besnard J-B, Adam J, Shende S, et al. Introducing task-containers as an alternative to runtime-stacking. Paper presented at: Proceedings of the 23rd European MPI Users' Group Meeting (EuroMPI 2016); 2016; Edinburgh, UK.
- 14Cho J-Y, Jin H-W, Nam D. Enhanced memory management for scalable MPI intra-node communication on many-core processor. Paper presented at: Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI '17); 2017; Chicago, IL.
- 15Brightwell R, Pedretti K, Hudson T. SMARTMAP: operating system support for efficient data sharing among processes on a multi-core processor. Paper presented at: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC '08); 2008; Austin, TX.
- 16Moreaud S, Goglin B, Namyst R, Goodell D. Optimizing MPI communication within large multicore nodes with kernel assistance. Paper presented at: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW); 2010; Atlanta, GA.
- 17Chai L, Hartono A, Panda DK. Designing high performance and scalable MPI intra-node communication support for clusters. Paper presented at: 2006 IEEE International Conference on Cluster Computing; 2006; Barcelona, Spain.
- 18Vienne J. Benefits of cross memory attach for MPI libraries on HPC clusters. Paper presented at: Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment (XSEDE '14); 2014; Atlanta, GA.
- 19Jin H-W, Sur S, Chai L, Panda D. LiMIC: support for high-performance MPI intra-node communication on Linux cluster. Paper presented at: Proceedings of the 2005 International Conference on Parallel (ICPP '05); 2005; Oslo, Norway.
- 20Holmes D, Booth S. McMPI: A Managed-code MPI Library in Pure C#. Paper presented at: Proceedings of the 20th European MPI Users' Group Meeting (EuroMPI '13); 2013; Madrid, Spain.
- 21Shimada A, Hori A, Ishikawa Y. Eliminating costs for crossing process boundary from MPI intra-node communication. Paper presented at: Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/ASIA '14); 2014; Kyoto, Japan.
- 22Friedley A, Bronevetsky G, Hoefler T, Lumsdaine A. Hybrid MPI: efficient message passing for multi-core systems. Paper presented at: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13); 2013; Denver, CO.
- 23Perache M, Carribault P, Jourdren H. MPC-MPI: an MPI implementation reducing the overall memory consumption. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface; 16th European PVM/MPI Users' Group Meeting, Espoo, Finland, September 7-10, 2009, Proceedings. Berlin, Germany: Springer-Verlag Berlin Heidelberg; 2009: 94-103.
10.1007/978-3-642-03770-2_16 Google Scholar
- 24Perache M, Jourdren H, Namyst R. MPC: a unified parallel runtime for clusters of NUMA machines. In: Euro-Par 2008 Workshops - Parallel Processing: VHPC 2008, UNICORE 2008, HPPC 2008, SGS 2008, PROPER 2008, ROIA 2008, and DPA 2008, Las Palmas de Gran Canaria, Spain, August 25-26, 2008, Revised Selected Papers.Berlin, Germany: Springer-Verlag Berlin, Heidelberg; 2008.
10.1007/978-3-540-85451-7_9 Google Scholar
- 25 Quartz - Top500 List. https://www.top500.org/system/178971. Accessed November 13, 2017.
- 26 Cori - Top500 List. https://www.top500.org/system/178924. Accessed November 13, 2017.
- 27 OSU Micro-benchmarks suite. http://mvapich.cse.ohio-state.edu/benchmarks/. Accessed September 10, 2017.
- 28McCalpin JD. Memory bandwidth and machine balance in current high performance computers. IEEE Comput Soc Tech Comm Comput Archit. 1995: 19-25.