Current architecture complexity requires fine tuning of compiler and runtime parameters to achieve best performance. Autotuning substantially improves default parameters in many scenarios, but it is a costly process requiring long iterative evaluations. We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: Each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program. Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover, unlike whole-program tuning, CERE customizes the set of best parameters for each specific OpenMP region or loop. We demonstrate the CERE tuning of compiler optimizations, number of threads, thread affinity, and scheduling policy on both nonuniform memory access and heterogeneous architectures. Over the NAS benchmarks, we achieve an average speedup of 1.08× after tuning. Tuning a codelet is 13× cheaper than whole-program evaluation and predicts the tuning impact with a 94.7% accuracy. Similarly, exploring thread configurations and scheduling policies for a Black-Scholes solver on an heterogeneous big.LITTLE architecture is over 40× faster using CERE.

REFERENCES

1Popov M, Akel C, Jalby W, Castro PO. Piecewise holistic autotuning of compiler and runtime parameters. In: C Kaklamanis, TS Papatheodorou, PG Spirakis, eds. Euro-Par 2016 Parallel Processing - 22nd International Conference, Lecture Notes in Computer Science, vol. 9833, Grenoble, France: Springer; 2016: 238-250.
10.1007/978-3-319-43659-3_18
Google Scholar
2Lattner C, Adve V. Llvm: A compilation framework for lifelong program analysis & transformation. Code Generation and Optimization, 2004. CGO 2004. International Symposium on: IEEE, Palo Alto, USA; 2004: 75-86.
10.1109/CGO.2004.1281665
Google Scholar
3Kisuki T, Knijnenburg P, OBoyle M, Wijshoff H. Iterative compilation in program optimization. In: Proc. CPC10 (Compilers for Parallel Computers), Vienna, Austria; 2000: 35-44.
Google Scholar
4Mazouz A, Barthou D, et al. Performance evaluation and analysis of thread pinning strategies on multi-core platforms: Case study of spec omp applications on intel architectures. High Performance Computing and Simulation (HPCS), 2011 International Conference on: IEEE, Istanbul, Turkey; 2011: 273-279.
10.1109/HPCSim.2011.5999834
Google Scholar
5Rountree B, Lownenthal DK, De Supinski BR, Schulz M, Freeh VW, Bletsch T. Adagio: making dvs practical for complex hpc applications. Proceedings of the 23rd International Conference on Supercomputing: ACM, Yorktown Heights, NY, USA; 2009: 460-469.
10.1145/1542275.1542340
Web of Science® Google Scholar
6Triantafyllis S, Vachharajani M, Vachharajani N, August DI. Compiler optimization-space exploration. Code Generation and Optimization, 2003. CGO 2003. International Symposium on: IEEE, San Francisco, USA; 2003: 204-215.
10.1109/CGO.2003.1191546
Google Scholar
7Ladd SR. Acovea: Analysis of compiler options via evolutionary algorithm; 2007.
Google Scholar
8Cooper KD, Schielke PJ, Subramanian D. Optimizing for reduced code space using genetic algorithms. ACM SIGPLAN Notices, Vol. 34: ACM, New York, NY, USA; 1999: 1-9.
10.1145/314403.314414
Google Scholar
9Hoste K, Eeckhout L. Cole: compiler optimization level exploration. Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization: ACM, Boston, MA, USA; 2008: 165-174.
10.1145/1356058.1356080
Web of Science® Google Scholar
10de Oliveira Castro P, Petit E, Farjallah A, Jalby W. Adaptive sampling for performance characterization of application kernels. Concurrency Comput Pract Experience. 2013; 25(17): 2345-2362.
10.1002/cpe.3097
Web of Science® Google Scholar
11Fursin G, et al. Milepost gcc: Machine learning enabled self-tuning compiler. Int J Parallel Program. 2011; 39(3): 296-327.
10.1007/s10766-010-0161-2
Web of Science® Google Scholar
12de Oliveira Castro P, Akel C, Petit E, Popov M, Jalby W. Cere: Llvm-based codelet extractor and replayer for piecewise benchmarking and optimization. ACM Trans Archit Code Optim (TACO). 2015; 12(1): 6.
Web of Science® Google Scholar
13Popov M, Akel C, Conti F, Jalby W, de Oliveira Castro P. Pcere: Fine-grained parallel benchmark decomposition for scalability prediction. Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International: IEEE, Hyderabad International Convention Centre Hyderabad, India; 2015: 1151-1160.
10.1109/IPDPS.2015.19
Google Scholar
14Kessler RE, Hill MD, Wood DA. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans Comput. 1994; 43(6): 664-675.
10.1109/12.286300
Web of Science® Google Scholar
15Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344: John Wiley & Sons; 2009.
Google Scholar
16Agakov F, Bonilla E, Cavazos J, et al. Using machine learning to focus iterative optimization. Proceedings of the International Symposium on Code Generation and Optimization: IEEE Computer Society, New York, NY, USA; 2006: 295-305.
Google Scholar
17Charif-Rubial AS, Oseret E, Noudohouenou J, Jalby W, Lartigue G. Cqa: A code quality analyzer tool at binary level. High Performance Computing (HIPC), 2014 21st International Conference on: IEEE, Goa, India; 2014: 1-10.
10.1109/HiPC.2014.7116904
Google Scholar
18Treibig J, Hager G, Wellein G. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. Parallel Processing Workshops (ICPPW), 2010 39th International Conference on: IEEE, San Diego, CA; 2010: 207-216.
10.1109/ICPPW.2010.38
Google Scholar
19Ward JH. Hierarchical grouping to optimize an objective function. J Am Statistical Assoc. 1963; 58(301): 236-244.
10.1080/01621459.1963.10500845
Web of Science® Google Scholar
20Thorndike R. Who belongs in the family?. Psychometrika. 1953; 18(4): 267-276.
10.1007/BF02289263
Google Scholar
21de Oliveira Castro P, Kashnikov Y, Akel C, Popov M, Jalby W. Fine-grained benchmark subsetting for system selection. Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization: ACM, Orlando, FL, USA; 2014: 132.
10.1145/2581122.2544144
Google Scholar
22Chen Y, Huang Y, Eeckhout L, Fursin G, Peng L, Temam O, Wu C. Evaluating iterative optimization across 1000 data sets. Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation (PLDI'10). ACM: Toronto, Canada; 2010: 448-459.
10.1145/1806596.1806647
Google Scholar
23Curtis-Maury M, Blagojevic F, Antonopoulos CD, Nikolopoulos DS. Prediction-based power-performance adaptation of multithreaded scientific codes. IEEE Trans Parallel Distrib Syst. 2008; 19(10): 1396-1410.
10.1109/TPDS.2007.70804
Web of Science® Google Scholar
24 Intel. Reference Guide for the Intel(R) C++ Compiler 15.0. https://software.intel.com/en-us/node/522691
Google Scholar
25Bailey D. H, et al. The NAS parallel benchmarks summary and preliminary results. Proceedings of the Conference on Supercomputing: ACM/IEEE, Albuquerque, NM, USA; 1991: 158-165.
Google Scholar
26Popov M. NAS 3.0 C OpenMP. http://benchmark-subsetting.github.io/cNPB
Google Scholar
27Baysal E. Reverse time migration. Geophys. 1983; 48(11): 1514.
10.1190/1.1441434
Web of Science® Google Scholar
28 ARM. Juno ARM Development Platform Datasheet. https://www.arm.com/fi les/pdf/Juno_ARM_Development_Platform_datasheet.pdf; 2014.
Google Scholar
29 ARM. Arm cortex-a57 mpcore processor technical reference manual revision r1p1; 2016.
Google Scholar
30Martins LG, Nobre R, Cardoso JM, Delbem AC, Marques E. Clustering-based selection for the exploration of compiler optimization sequences. ACM Trans Archit Code Optim (TACO). 2016; 13(1): 8.
Web of Science® Google Scholar
31Ashouri AH, Mariani G, Palermo G, Park E, Cavazos John, Silvano Cristina. Cobayn: Compiler autotuning framework using bayesian networks. ACM Trans Archit Code Optim (TACO). 2016; 13(2): 21.
Web of Science® Google Scholar
32Kulkarni PA, Whalley DB, Tyson GS, Davidson JW. In search of near-optimal optimization phase orderings. ACM SIGPLAN Notices, Vol. 41: ACM, New York, NY, USA; 2006: 83-92.
10.1145/1134650.1134663
Google Scholar
33Cooper KD, Grosul A, Harvey TJ, Reeves S, Subramanian D, Torczon L, Waterman T. Acme: adaptive compilation made efficient. ACM SIGPLAN Notices, Vol. 40: ACM, New York, NY, USA; 2005: 69-77.
10.1145/1065910.1065921
Google Scholar
34Purini S, Jain L. Finding good optimization sequences covering program space. ACM Trans Archit Code Optim (TACO). 2013; 9(4): 56.
Web of Science® Google Scholar
35Eeckhout L, Sampson J, Calder B. Exploiting program microarchitecture independent characteristics and phase behavior for reduced benchmark suite simulation. Workload Characterization Symposium, 2005. Proceedings of the IEEE International: IEEE, Austin, Texas; 2005: 2-12.
10.1109/IISWC.2005.1525996
Google Scholar
36Sherwood T, Perelman E, Calder B. Basic block distribution analysis to find periodic behavior and simulation points in applications. Parallel Architectures and Compilation Techniques, 2001. Proceedings. 2001 International Conference on: IEEE, Barcelona, Spain; 2001: 3-14.
10.1109/PACT.2001.953283
Google Scholar
37Carlson TE, Heirman W, Van Craeynest K, Eeckhout L. Barrierpoint: Sampled simulation of multi-threaded applications. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA USA; 2014: 2-12.
Google Scholar
38Fursin G, Cohen A, OBoyle M, Temam O. Quick and practical run-time evaluation of multiple program optimizations. Transactions on High-Performance Embedded Architectures and Compilers I: Springer, Ghent, Belgium; 2007: 34-53.
10.1007/978-3-540-71528-3_4
Google Scholar
39Kulkarni PA, Jantz MR, Whalley DB. Improving both the performance benefits and speed of optimization phase sequence searches. ACM SIGPLAN Notices, Vol. 45: ACM, New York, NY, USA; 2010: 95-104.
10.1145/1755888.1755903
Google Scholar
40Liao C, Quinlan DJ, Vuduc R, Panas T. Effective source-to-source outlining to support whole program empirical optimization. Languages and Compilers for Parallel Computing: Springer, Houston, TX, USA; 2010: 308-322.
10.1007/978-3-642-13374-9_21
Google Scholar
41Akel C, Kashnikov Y, de Oliveira Castro P, Jalby W. Is Source-code Isolation Viable for Performance Characterization?. International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI): IEEE Computer Society, Lyon, France; 2013: 977-984.
10.1109/ICPP.2013.116
Google Scholar

Citing Literature

Volume29, Issue15

Combined Special Issues on Euro‐Par 2016 and the Workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016)

10 August 2017

e4190

Piecewise holistic autotuning of parallel programs with CERE

Summary

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Piecewise holistic autotuning of parallel programs with CERE

Summary

REFERENCES

Citing Literature

References

Related

Information