Assessing the performance portability of modern parallel programming models using TeaLeaf
Corresponding Author
Matthew Martineau
HPC Group, University of Bristol, Bristol, UK
Correspondence
Matthew Martineau, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK.
Email: [email protected]
Search for more papers by this authorSimon McIntosh-Smith
HPC Group, University of Bristol, Bristol, UK
Search for more papers by this authorWayne Gaudin
UK Atomic Weapons Establishment (AWE), Aldermaston, UK
Search for more papers by this authorCorresponding Author
Matthew Martineau
HPC Group, University of Bristol, Bristol, UK
Correspondence
Matthew Martineau, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK.
Email: [email protected]
Search for more papers by this authorSimon McIntosh-Smith
HPC Group, University of Bristol, Bristol, UK
Search for more papers by this authorWayne Gaudin
UK Atomic Weapons Establishment (AWE), Aldermaston, UK
Search for more papers by this authorSummary
In this work, we evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port Tealeaf, a miniature proxy application, or mini app, that solves the heat conduction equation and belongs to the Mantevo Project. We find that the best performance is achieved with architecture-specific implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5% to 30% performance penalty. While the models expose varying levels of complexity to the developer, they all achieve reasonable performance with this application. As such, if this small performance penalty is permissible for a problem domain, we believe that productivity and development complexity can be considered the major differentiators when choosing a modern parallel programming model to develop applications like Tealeaf.
REFERENCES
- 1Heroux MA, Doerfler DW, Crozier PS. Improving Performance via Mini-applications. Technical Report SAND2009-5574, Sandia National Laboratories, Albuquerque, New Mexico; 2009.
- 2Kogge P, Shalf J. Exascale computing trends: adjusting to the “New Normal” for computer architecture. Comput Sci Eng. 2013; 15(6): 16–26.
- 3Hornung RD, Keasler JA. The RAJA Portability Layer: Overview and Status. Technical Report LLNL-TR-661403, Lawrence Livermore National Laboratory, Livermore, California; 2014.
- 4Asanovic K, Bodik R, Catanzaro BC, et al. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, Berkeley, EECS Department, University of California, Berkeley, California; 2006.
- 5Boulton M, McIntosh-Smith S. Optimising sparse iterative solvers for many-core computer architectures. UK Many-Core Developer Conference (UKMAC); 2014.
- 6 UKMAC. UK Mini-App Consortium: TeaLeaf. http://uk-mac.github.io/TeaLeaf; 2015. Accessed February 2017.
- 7Martinez G, Gardner M, Feng W-c. CU2CL: A CUDA-to-OpenCL translator for multi-and many-core architectures. 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, Tainan, Taiwan; 2011: 300–307.
- 8Munshi A, Gaster B, Mattson TG, Ginsburg D. OpenCL Programming Guide. Boston, MA: Pearson Education Inc.; 2011.
- 9 Khronos OpenCL Working Group. The OpenCL Specification Version 1.2; 2015.
- 10Mallinson AC, Beckingsale DA, Gaudin WP, et al. Towards portable performance for explicit hydrodynamics codes. 2013.
- 11Stone JE, Gohara D, Shi G. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput Sci Eng. 2010; 12(1-3): 66–73.
- 12 OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.0; 2013.
- 13Liao C, Yan Y, de Supinski BR, Quinlan DJ, Chapman B. Early experiences with the Owill primarily be targettable aopenMP accelerator model. OopenMP in the Era of Low Power Devices and Accelerators: Springer, Berlin Heidelberg; 2013: 84–98.
10.1007/978-3-642-40698-0_7 Google Scholar
- 14Wienke S, Terboven C, Beyer JC, Müller MS. A pattern-based comparison of OpenACC and OpenMP for accelerator computing. Euro-Par 2014 Parallel Processing. Switzerland: Springer International Publishing; 2014: 812–823.
10.1007/978-3-319-09873-9_68 Google Scholar
- 15Bercea G, Bertolli C, Antao SF, et al. Performance analysis of openmp on a gpu using a coral proxy application. Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, ACM, Austin, Texas; 2015: 2.
- 16Sidelnik A, Maleki S, et al.. Performance portability with the chapel language. 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), IEEE, Shanghai, China; 2012: 582–594.
- 17Zhang Y, Sinclair II M, Chien AA. Improving performance portability in OpenCL programs. Supercomputing, Springer, Leipzig, Germany; 2013: 136–150.
- 18Edwards HC, Trott CR, Sunderland D. Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J Parallel Distr Com. 2014; 74(12): 3202–3216.
- 19Che S, Boyer M, et al. A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distr Com. 2008; 68(10): 1370–1380.
- 20Lee D, Dinov I, Dong B, et al. CUDA optimization strategies for compute-and memory-bound neuroimaging algorithms. Comput Meth Prog Bio. 2012; 106(3): 175–187.
- 21Peng D, Rick W, Piotr L, et al. From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 2012; 38(8): 391–407.
- 22McIntosh-Smith S, Boulton M, Curran D, Price J. On the performance portability of structured grid codes on many-core computer architectures. Supercomputing, Lecture Notes in Computer Science, vol. 8488. Switzerland: Springer International Publishing; 2014: 53–75.
10.1007/978-3-319-07518-1_4 Google Scholar
- 23Trott CR, Hoemmen M, Hammond SD, Edwards HC. Kokkos Programming Guide. Technical Report SAND2015-4178, Sandia National Laboratories, Albuquerque, New Mexico; 2015.
- 24Kukanov A, Polin V, Voss MJ. Flow graphs, speculative locks, and task arenas in Intel® threading building blocks. Graduate from MIT to GCC Mainline; 2014: 15.
- 25Herdman JA, Gaudin WP, McIntosh-Smith S, Boulton M, Beckingsale DA, Mallinson AC, Jarvis SA. Accelerating Hydrocodes with OpenACC, OpenCL and CUDA. High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, IEEE, Salt Lake City, Utah; 2012: 465–471.
- 26Reguly IZ, Mudalige GR, Giles MB, Curran D, McIntosh-Smith S. The ops domain specific abstraction for multi-block structured grid computations. Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC '14, New Orleans, Louisiana; 2014.
- 27Mudalige GR, Reguly IZ, Giles MB, Mallinson AC, Gaudin WP, Herdman JA. High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation: 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers, chapter Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems. New Orleans, LA, USA: Springer International Publishing; 2015.
- 28Rupp K, Tillet P, Rudolf F, et al. Performance portability study of linear algebra kernels in OpenCL. Proceedings of the International Workshop on OpenCL 2013 & 2014, IWOCL '14. ACM, New York, NY, USA; 2014: 8:1–8:11.
- 29Lee S, Vetter JS. Early evaluation of directive-based GPU programming models for productive exascale computing. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA: IEEE Computer Society Press; 2012: 23.
10.1109/SC.2012.51 Google Scholar
- 30Hart A. First experiences porting a parallel application to a hybrid supercomputer with OpenMP 4.0 device constructs. Openmp: Heterogenous Execution and Data Movements: 11th International Workshop on OpenMP, IWOMP 2015, Proceedings, Aachen, Germany; 2015: 73–85.
- 31Teodoro G, Kurc T, Kong J, Cooper L, Saltz J. Comparative performance analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: a case study from microscopy image analysis. 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IEEE, Phoenix, Arizona; 2014: 1063–1072.