Piecewise holistic autotuning of parallel programs with CERE
Corresponding Author
Mihail Popov
LI-PaRAD, University of Versailles, France
Correspondence
Mihail Popov, LI-PaRAD, Université de Versailles, 45, avenue des États Unis, Versailles France.
Email: [email protected]
Search for more papers by this authorPablo de Oliveira Castro
LI-PaRAD, University of Versailles, France
Search for more papers by this authorCorresponding Author
Mihail Popov
LI-PaRAD, University of Versailles, France
Correspondence
Mihail Popov, LI-PaRAD, Université de Versailles, 45, avenue des États Unis, Versailles France.
Email: [email protected]
Search for more papers by this authorPablo de Oliveira Castro
LI-PaRAD, University of Versailles, France
Search for more papers by this authorSummary
Current architecture complexity requires fine tuning of compiler and runtime parameters to achieve best performance. Autotuning substantially improves default parameters in many scenarios, but it is a costly process requiring long iterative evaluations. We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: Each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program. Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover, unlike whole-program tuning, CERE customizes the set of best parameters for each specific OpenMP region or loop. We demonstrate the CERE tuning of compiler optimizations, number of threads, thread affinity, and scheduling policy on both nonuniform memory access and heterogeneous architectures. Over the NAS benchmarks, we achieve an average speedup of 1.08× after tuning. Tuning a codelet is 13× cheaper than whole-program evaluation and predicts the tuning impact with a 94.7% accuracy. Similarly, exploring thread configurations and scheduling policies for a Black-Scholes solver on an heterogeneous big.LITTLE architecture is over 40× faster using CERE.
REFERENCES
- 1Popov M, Akel C, Jalby W, Castro PO. Piecewise holistic autotuning of compiler and runtime parameters. In: C Kaklamanis, TS Papatheodorou, PG Spirakis, eds. Euro-Par 2016 Parallel Processing - 22nd International Conference, Lecture Notes in Computer Science, vol. 9833, Grenoble, France: Springer; 2016: 238-250.
10.1007/978-3-319-43659-3_18 Google Scholar
- 2Lattner C, Adve V. Llvm: A compilation framework for lifelong program analysis & transformation. Code Generation and Optimization, 2004. CGO 2004. International Symposium on: IEEE, Palo Alto, USA; 2004: 75-86.
10.1109/CGO.2004.1281665 Google Scholar
- 3Kisuki T, Knijnenburg P, OBoyle M, Wijshoff H. Iterative compilation in program optimization. In: Proc. CPC10 (Compilers for Parallel Computers), Vienna, Austria; 2000: 35-44.
- 4Mazouz A, Barthou D, et al. Performance evaluation and analysis of thread pinning strategies on multi-core platforms: Case study of spec omp applications on intel architectures. High Performance Computing and Simulation (HPCS), 2011 International Conference on: IEEE, Istanbul, Turkey; 2011: 273-279.
10.1109/HPCSim.2011.5999834 Google Scholar
- 5Rountree B, Lownenthal DK, De Supinski BR, Schulz M, Freeh VW, Bletsch T. Adagio: making dvs practical for complex hpc applications. Proceedings of the 23rd International Conference on Supercomputing: ACM, Yorktown Heights, NY, USA; 2009: 460-469.
- 6Triantafyllis S, Vachharajani M, Vachharajani N, August DI. Compiler optimization-space exploration. Code Generation and Optimization, 2003. CGO 2003. International Symposium on: IEEE, San Francisco, USA; 2003: 204-215.
10.1109/CGO.2003.1191546 Google Scholar
- 7Ladd SR. Acovea: Analysis of compiler options via evolutionary algorithm; 2007.
- 8Cooper KD, Schielke PJ, Subramanian D. Optimizing for reduced code space using genetic algorithms. ACM SIGPLAN Notices, Vol. 34: ACM, New York, NY, USA; 1999: 1-9.
10.1145/314403.314414 Google Scholar
- 9Hoste K, Eeckhout L. Cole: compiler optimization level exploration. Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization: ACM, Boston, MA, USA; 2008: 165-174.
- 10de Oliveira Castro P, Petit E, Farjallah A, Jalby W. Adaptive sampling for performance characterization of application kernels. Concurrency Comput Pract Experience. 2013; 25(17): 2345-2362.
- 11Fursin G, et al. Milepost gcc: Machine learning enabled self-tuning compiler. Int J Parallel Program. 2011; 39(3): 296-327.
- 12de Oliveira Castro P, Akel C, Petit E, Popov M, Jalby W. Cere: Llvm-based codelet extractor and replayer for piecewise benchmarking and optimization. ACM Trans Archit Code Optim (TACO). 2015; 12(1): 6.
- 13Popov M, Akel C, Conti F, Jalby W, de Oliveira Castro P. Pcere: Fine-grained parallel benchmark decomposition for scalability prediction. Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International: IEEE, Hyderabad International Convention Centre Hyderabad, India; 2015: 1151-1160.
10.1109/IPDPS.2015.19 Google Scholar
- 14Kessler RE, Hill MD, Wood DA. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans Comput. 1994; 43(6): 664-675.
- 15Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344: John Wiley & Sons; 2009.
- 16Agakov F, Bonilla E, Cavazos J, et al. Using machine learning to focus iterative optimization. Proceedings of the International Symposium on Code Generation and Optimization: IEEE Computer Society, New York, NY, USA; 2006: 295-305.
- 17Charif-Rubial AS, Oseret E, Noudohouenou J, Jalby W, Lartigue G. Cqa: A code quality analyzer tool at binary level. High Performance Computing (HIPC), 2014 21st International Conference on: IEEE, Goa, India; 2014: 1-10.
10.1109/HiPC.2014.7116904 Google Scholar
- 18Treibig J, Hager G, Wellein G. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. Parallel Processing Workshops (ICPPW), 2010 39th International Conference on: IEEE, San Diego, CA; 2010: 207-216.
10.1109/ICPPW.2010.38 Google Scholar
- 19Ward JH. Hierarchical grouping to optimize an objective function. J Am Statistical Assoc. 1963; 58(301): 236-244.
- 20Thorndike R. Who belongs in the family?. Psychometrika. 1953; 18(4): 267-276.
10.1007/BF02289263 Google Scholar
- 21de Oliveira Castro P, Kashnikov Y, Akel C, Popov M, Jalby W. Fine-grained benchmark subsetting for system selection. Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization: ACM, Orlando, FL, USA; 2014: 132.
10.1145/2581122.2544144 Google Scholar
- 22Chen Y, Huang Y, Eeckhout L, Fursin G, Peng L, Temam O, Wu C. Evaluating iterative optimization across 1000 data sets. Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation (PLDI'10). ACM: Toronto, Canada; 2010: 448-459.
10.1145/1806596.1806647 Google Scholar
- 23Curtis-Maury M, Blagojevic F, Antonopoulos CD, Nikolopoulos DS. Prediction-based power-performance adaptation of multithreaded scientific codes. IEEE Trans Parallel Distrib Syst. 2008; 19(10): 1396-1410.
- 24 Intel. Reference Guide for the Intel(R) C++ Compiler 15.0. https://software.intel.com/en-us/node/522691
- 25Bailey D. H, et al. The NAS parallel benchmarks summary and preliminary results. Proceedings of the Conference on Supercomputing: ACM/IEEE, Albuquerque, NM, USA; 1991: 158-165.
- 26Popov M. NAS 3.0 C OpenMP. http://benchmark-subsetting.github.io/cNPB
- 27Baysal E. Reverse time migration. Geophys. 1983; 48(11): 1514.
- 28 ARM. Juno ARM Development Platform Datasheet. https://www.arm.com/fi les/pdf/Juno_ARM_Development_Platform_datasheet.pdf; 2014.
- 29 ARM. Arm cortex-a57 mpcore processor technical reference manual revision r1p1; 2016.
- 30Martins LG, Nobre R, Cardoso JM, Delbem AC, Marques E. Clustering-based selection for the exploration of compiler optimization sequences. ACM Trans Archit Code Optim (TACO). 2016; 13(1): 8.
- 31Ashouri AH, Mariani G, Palermo G, Park E, Cavazos John, Silvano Cristina. Cobayn: Compiler autotuning framework using bayesian networks. ACM Trans Archit Code Optim (TACO). 2016; 13(2): 21.
- 32Kulkarni PA, Whalley DB, Tyson GS, Davidson JW. In search of near-optimal optimization phase orderings. ACM SIGPLAN Notices, Vol. 41: ACM, New York, NY, USA; 2006: 83-92.
10.1145/1134650.1134663 Google Scholar
- 33Cooper KD, Grosul A, Harvey TJ, Reeves S, Subramanian D, Torczon L, Waterman T. Acme: adaptive compilation made efficient. ACM SIGPLAN Notices, Vol. 40: ACM, New York, NY, USA; 2005: 69-77.
10.1145/1065910.1065921 Google Scholar
- 34Purini S, Jain L. Finding good optimization sequences covering program space. ACM Trans Archit Code Optim (TACO). 2013; 9(4): 56.
- 35Eeckhout L, Sampson J, Calder B. Exploiting program microarchitecture independent characteristics and phase behavior for reduced benchmark suite simulation. Workload Characterization Symposium, 2005. Proceedings of the IEEE International: IEEE, Austin, Texas; 2005: 2-12.
10.1109/IISWC.2005.1525996 Google Scholar
- 36Sherwood T, Perelman E, Calder B. Basic block distribution analysis to find periodic behavior and simulation points in applications. Parallel Architectures and Compilation Techniques, 2001. Proceedings. 2001 International Conference on: IEEE, Barcelona, Spain; 2001: 3-14.
10.1109/PACT.2001.953283 Google Scholar
- 37Carlson TE, Heirman W, Van Craeynest K, Eeckhout L. Barrierpoint: Sampled simulation of multi-threaded applications. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA USA; 2014: 2-12.
- 38Fursin G, Cohen A, OBoyle M, Temam O. Quick and practical run-time evaluation of multiple program optimizations. Transactions on High-Performance Embedded Architectures and Compilers I: Springer, Ghent, Belgium; 2007: 34-53.
10.1007/978-3-540-71528-3_4 Google Scholar
- 39Kulkarni PA, Jantz MR, Whalley DB. Improving both the performance benefits and speed of optimization phase sequence searches. ACM SIGPLAN Notices, Vol. 45: ACM, New York, NY, USA; 2010: 95-104.
10.1145/1755888.1755903 Google Scholar
- 40Liao C, Quinlan DJ, Vuduc R, Panas T. Effective source-to-source outlining to support whole program empirical optimization. Languages and Compilers for Parallel Computing: Springer, Houston, TX, USA; 2010: 308-322.
10.1007/978-3-642-13374-9_21 Google Scholar
- 41Akel C, Kashnikov Y, de Oliveira Castro P, Jalby W. Is Source-code Isolation Viable for Performance Characterization?. International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI): IEEE Computer Society, Lyon, France; 2013: 977-984.
10.1109/ICPP.2013.116 Google Scholar