In this work, we evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port Tealeaf, a miniature proxy application, or mini app, that solves the heat conduction equation and belongs to the Mantevo Project. We find that the best performance is achieved with architecture-specific implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5% to 30% performance penalty. While the models expose varying levels of complexity to the developer, they all achieve reasonable performance with this application. As such, if this small performance penalty is permissible for a problem domain, we believe that productivity and development complexity can be considered the major differentiators when choosing a modern parallel programming model to develop applications like Tealeaf.

REFERENCES

1Heroux MA, Doerfler DW, Crozier PS. Improving Performance via Mini-applications. Technical Report SAND2009-5574, Sandia National Laboratories, Albuquerque, New Mexico; 2009.
Google Scholar
2Kogge P, Shalf J. Exascale computing trends: adjusting to the “New Normal” for computer architecture. Comput Sci Eng. 2013; 15(6): 16–26.
10.1109/MCSE.2013.95
Web of Science® Google Scholar
3Hornung RD, Keasler JA. The RAJA Portability Layer: Overview and Status. Technical Report LLNL-TR-661403, Lawrence Livermore National Laboratory, Livermore, California; 2014.
Google Scholar
4Asanovic K, Bodik R, Catanzaro BC, et al. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, Berkeley, EECS Department, University of California, Berkeley, California; 2006.
Google Scholar
5Boulton M, McIntosh-Smith S. Optimising sparse iterative solvers for many-core computer architectures. UK Many-Core Developer Conference (UKMAC); 2014.
Google Scholar
6 UKMAC. UK Mini-App Consortium: TeaLeaf. http://uk-mac.github.io/TeaLeaf; 2015. Accessed February 2017.
Google Scholar
7Martinez G, Gardner M, Feng W-c. CU2CL: A CUDA-to-OpenCL translator for multi-and many-core architectures. 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, Tainan, Taiwan; 2011: 300–307.
Google Scholar
8Munshi A, Gaster B, Mattson TG, Ginsburg D. OpenCL Programming Guide. Boston, MA: Pearson Education Inc.; 2011.
Google Scholar
9 Khronos OpenCL Working Group. The OpenCL Specification Version 1.2; 2015.
Google Scholar
10Mallinson AC, Beckingsale DA, Gaudin WP, et al. Towards portable performance for explicit hydrodynamics codes. 2013.
Google Scholar
11Stone JE, Gohara D, Shi G. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput Sci Eng. 2010; 12(1-3): 66–73.
10.1109/MCSE.2010.69
PubMed Web of Science® Google Scholar
12 OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.0; 2013.
Google Scholar
13Liao C, Yan Y, de Supinski BR, Quinlan DJ, Chapman B. Early experiences with the Owill primarily be targettable aopenMP accelerator model. OopenMP in the Era of Low Power Devices and Accelerators: Springer, Berlin Heidelberg; 2013: 84–98.
10.1007/978-3-642-40698-0_7
Google Scholar
14Wienke S, Terboven C, Beyer JC, Müller MS. A pattern-based comparison of OpenACC and OpenMP for accelerator computing. Euro-Par 2014 Parallel Processing. Switzerland: Springer International Publishing; 2014: 812–823.
10.1007/978-3-319-09873-9_68
Google Scholar
15Bercea G, Bertolli C, Antao SF, et al. Performance analysis of openmp on a gpu using a coral proxy application. Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, ACM, Austin, Texas; 2015: 2.
Google Scholar
16Sidelnik A, Maleki S, et al.. Performance portability with the chapel language. 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), IEEE, Shanghai, China; 2012: 582–594.
Google Scholar
17Zhang Y, Sinclair II M, Chien AA. Improving performance portability in OpenCL programs. Supercomputing, Springer, Leipzig, Germany; 2013: 136–150.
Google Scholar
18Edwards HC, Trott CR, Sunderland D. Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J Parallel Distr Com. 2014; 74(12): 3202–3216.
10.1016/j.jpdc.2014.07.003
Web of Science® Google Scholar
19Che S, Boyer M, et al. A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distr Com. 2008; 68(10): 1370–1380.
10.1016/j.jpdc.2008.05.014
Web of Science® Google Scholar
20Lee D, Dinov I, Dong B, et al. CUDA optimization strategies for compute-and memory-bound neuroimaging algorithms. Comput Meth Prog Bio. 2012; 106(3): 175–187.
10.1016/j.cmpb.2010.10.013
PubMed Web of Science® Google Scholar
21Peng D, Rick W, Piotr L, et al. From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 2012; 38(8): 391–407.
10.1016/j.parco.2011.10.002
Web of Science® Google Scholar
22McIntosh-Smith S, Boulton M, Curran D, Price J. On the performance portability of structured grid codes on many-core computer architectures. Supercomputing, Lecture Notes in Computer Science, vol. 8488. Switzerland: Springer International Publishing; 2014: 53–75.
10.1007/978-3-319-07518-1_4
Google Scholar
23Trott CR, Hoemmen M, Hammond SD, Edwards HC. Kokkos Programming Guide. Technical Report SAND2015-4178, Sandia National Laboratories, Albuquerque, New Mexico; 2015.
Google Scholar
24Kukanov A, Polin V, Voss MJ. Flow graphs, speculative locks, and task arenas in Intel® threading building blocks. Graduate from MIT to GCC Mainline; 2014: 15.
Google Scholar
25Herdman JA, Gaudin WP, McIntosh-Smith S, Boulton M, Beckingsale DA, Mallinson AC, Jarvis SA. Accelerating Hydrocodes with OpenACC, OpenCL and CUDA. High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, IEEE, Salt Lake City, Utah; 2012: 465–471.
Google Scholar
26Reguly IZ, Mudalige GR, Giles MB, Curran D, McIntosh-Smith S. The ops domain specific abstraction for multi-block structured grid computations. Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC '14, New Orleans, Louisiana; 2014.
Google Scholar
27Mudalige GR, Reguly IZ, Giles MB, Mallinson AC, Gaudin WP, Herdman JA. High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation: 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers, chapter Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems. New Orleans, LA, USA: Springer International Publishing; 2015.
Google Scholar
28Rupp K, Tillet P, Rudolf F, et al. Performance portability study of linear algebra kernels in OpenCL. Proceedings of the International Workshop on OpenCL 2013 & 2014, IWOCL '14. ACM, New York, NY, USA; 2014: 8:1–8:11.
Google Scholar
29Lee S, Vetter JS. Early evaluation of directive-based GPU programming models for productive exascale computing. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA: IEEE Computer Society Press; 2012: 23.
10.1109/SC.2012.51
Google Scholar
30Hart A. First experiences porting a parallel application to a hybrid supercomputer with OpenMP 4.0 device constructs. Openmp: Heterogenous Execution and Data Movements: 11th International Workshop on OpenMP, IWOMP 2015, Proceedings, Aachen, Germany; 2015: 73–85.
Google Scholar
31Teodoro G, Kurc T, Kong J, Cooper L, Saltz J. Comparative performance analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: a case study from microscopy image analysis. 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IEEE, Phoenix, Arizona; 2014: 1063–1072.
Google Scholar

Citing Literature

Volume29, Issue15

Combined Special Issues on Euro‐Par 2016 and the Workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016)

10 August 2017

e4117

Assessing the performance portability of modern parallel programming models using TeaLeaf

Summary

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Assessing the performance portability of modern parallel programming models using TeaLeaf

Summary

REFERENCES

Citing Literature

References

Related

Information