Volume 29, Issue 15 e4190
SPECIAL ISSUE PAPER

Piecewise holistic autotuning of parallel programs with CERE

Mihail Popov

Corresponding Author

Mihail Popov

LI-PaRAD, University of Versailles, France

Correspondence

Mihail Popov, LI-PaRAD, Université de Versailles, 45, avenue des États Unis, Versailles France.

Email: [email protected]

Search for more papers by this author
Chadi Akel

Chadi Akel

Exascale Computing Research, France

Search for more papers by this author
Yohan Chatelain

Yohan Chatelain

LI-PaRAD, University of Versailles, France

Search for more papers by this author
William Jalby

William Jalby

LI-PaRAD, University of Versailles, France

Search for more papers by this author
Pablo de Oliveira Castro

Pablo de Oliveira Castro

LI-PaRAD, University of Versailles, France

Search for more papers by this author
First published: 20 June 2017
Citations: 5

Summary

Current architecture complexity requires fine tuning of compiler and runtime parameters to achieve best performance. Autotuning substantially improves default parameters in many scenarios, but it is a costly process requiring long iterative evaluations. We propose an automatic piecewise autotuner based on CERE (Codelet Extractor and REplayer). CERE decomposes applications into small pieces called codelets: Each codelet maps to a loop or to an OpenMP parallel region and can be replayed as a standalone program. Codelet autotuning achieves better speedups at a lower tuning cost. By grouping codelet invocations with the same performance behavior, CERE reduces the number of loops or OpenMP regions to be evaluated. Moreover, unlike whole-program tuning, CERE customizes the set of best parameters for each specific OpenMP region or loop. We demonstrate the CERE tuning of compiler optimizations, number of threads, thread affinity, and scheduling policy on both nonuniform memory access and heterogeneous architectures. Over the NAS benchmarks, we achieve an average speedup of 1.08× after tuning. Tuning a codelet is 13× cheaper than whole-program evaluation and predicts the tuning impact with a 94.7% accuracy. Similarly, exploring thread configurations and scheduling policies for a Black-Scholes solver on an heterogeneous big.LITTLE architecture is over 40× faster using CERE.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.