Profiling divergences in GPU applications
Bruno Coutinho
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Search for more papers by this authorDiogo Sampaio
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Search for more papers by this authorFernando M. Q. Pereira
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Search for more papers by this authorCorresponding Author
Wagner Meira Jr.
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Correspondence to: Wagner Meira Jr., Department of Computer Science, Universidade Federal de Minas Gerais, Brazil.
E-mail: [email protected]
Search for more papers by this authorBruno Coutinho
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Search for more papers by this authorDiogo Sampaio
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Search for more papers by this authorFernando M. Q. Pereira
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Search for more papers by this authorCorresponding Author
Wagner Meira Jr.
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Correspondence to: Wagner Meira Jr., Department of Computer Science, Universidade Federal de Minas Gerais, Brazil.
E-mail: [email protected]
Search for more papers by this authorSUMMARY
The increasing programmability and the high computational power of graphics processing units make them attractive to general purpose programming. However, taking full benefit of this execution environment is a challenging task. One of these challenges stems from divergences, a phenomenon that occurs when threads that execute in lock-step are forced to take different program paths because of branches in the code. In face of divergences, some threads will have to wait, idly, while their diverging siblings execute. Optimizing the code to avoid divergences is difficult because this task demands a deep understanding of programs that might be large and convoluted. To facilitate the detection of divergences, this paper introduces the divergence map, a data structure that indicates the location and the volume of divergences in a program. We build this map via dynamic profiling techniques, which we have implemented on top of an open source Parallel Thread Execution compiler. To illustrate the importance of the divergence map, we have used it to pinpoint the core regions that must be optimized in well-known public applications. By hand optimizing some applications, we have added 9–11% speedups onto kernels that have already gone through the sieve of many programmers. Copyright © 2012 John Wiley & Sons, Ltd.
REFERENCES
- 1 Garland M. Parallel computing experiences with CUDA. IEEE Micro 2008; 28: 13–27.
- 2 Nickolls J, Dally WJ. The GPU computing era. IEEE Micro 2010; 30: 56–69.
- 3 Nickolls J, Kirk D. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy), 4th ed., chapter A. Elsevier: Amsterdam, Netherlands, 2009. pages A.1–A.77.
- 4 Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW. Optimization principles and application performance evaluation of a multithreaded GPU using cuda. In PPoPP, ACM: New York, NY, US, 2008; 73–82.
- 5 Seiler L, Carmean D, Sprangle E, Forsyth T, Abrash M, Dubey P, Junkins S, Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P. Larrabee: a many-core x86 architecture for visual computing. ACM Transactions on Graphics 2008; 27(3): 1–15.
- 6 Saha B, Zhou X, Chen H, Gao Y, Yan S, Rajagopalan M, Fang J, Zhang P, Ronen R, Mendelson A. Programming model for a heterogeneous x86 platform. In PLDI. ACM: New York, NY, US, 2009. 431–440.
- 7
Baghsorkhi SS,
Delahaye M,
Patel SJ,
Gropp WD,
Hwu WW. An adaptive performance modeling tool for GPU architectures. In PPoPP. ACM: New York, NY, US, 2010; 105–114.
10.1145/1693453.1693470 Google Scholar
- 8 Harris M. The parallel prefix sum (scan) with CUDA. Technical Report Initial release on February 14, 2007, NVIDIA, 2008.
- 9
Kerr A,
Diamos GF,
Yalamanchili S. A characterization and analysis of PTX kernels. In IISWC. IEEE: Los Alamitos, CA, USA, 2009; 3–12.
10.1109/IISWC.2009.5306801 Google Scholar
- 10 Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K. Rodinia: a benchmark suite for heterogeneous computing. In IISWC. IEEE: Los Alamitos, CA, USA, 2009; 44–54.
- 11 Cederman D, Tsigas P. GPU-quicksort: a practical quicksort algorithm for graphics processors. Journal of Experimental Algorithmics 2009; 14(1): 4–24.
- 12 Coutinho B, Sampaio D, Pereira FMQ, Wagner M, Jr. Performance debugging of GPGPU applications with the divergence map. In SBAC-PAD. IEEE: Los Alamitos, CA, USA, 2010; 33–40.
- 13
Choi JW,
Singh A,
Vuduc RW. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In PPoPP. ACM: New York, NY, US, 2010; 115–126.
10.1145/1693453.1693471 Google Scholar
- 14
Sandes EFO,
de Melo ACMA. Cudalign: using GPU to accelerate the comparison of megabase genomic sequences. In PPoPP. ACM: New York, NY, US, 2010; 137–146.
10.1145/1693453.1693473 Google Scholar
- 15
Lee S,
Min S-J,
Eigenmann R. Openmp to GPGPU: a compiler framework for automatic translation and optimization. In PPoPP. ACM: New York, NY, US, 2009; 101–110.
10.1145/1594835.1504194 Google Scholar
- 16 Fung WWL, Sham I, Yuan G, Aamodt TM. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO. IEEE: Los Alamitos, CA, USA, 2007; 407–420.
- 17 Graham SL, Kessler PB, McKusick MK. GPROF: a call graph execution profiler (with retrospective). In Best of PLDI. ACM: New York, NY, US, 1982; 49–57.
- 18 Chang PP, Mahlke SA, Hwu WW. Using profile information to assist classic code optimizations. Software Practice and Experience 1991; 21(12): 1301–1321.
- 19
Nethercote N,
Seward J. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI. ACM, 2007; 89–100.
10.1145/1250734.1250746 Google Scholar
- 20 Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B. Simics: a full system simulation platform. Computer 2002; 35: 50–58.
- 21 Srivastava A, Eustace A. Atom: a system for building customized program analysis tools. In PLDI. ACM, New York, NY, US, 1994; 196–205.
- 22 Eyerman S, Eeckhout L. Per-thread cycle accounting in SMT processors. ASPLOS 2009; 44(3): 133–144.
- 23 Ji M, Felten EW, Li K. Performance measurements for multithreaded programs. In SIGMETRICS. ACM: New York, NY, US, 1998; 161–170.
- 24 Miller BP, Callaghan MD, Cargille JM, Hollingsworth JK, Irvin RB, Karavanic KL, Kunchithapadam K, Newhall T. The paradyn parallel performance measurement tool. Computer 1995; 28: 37–46.
- 25 Tallent NR, Mellor-Crummey JM. Effective performance measurement and analysis of multithreaded applications. PPOPP 2009; 44(4): 229–240.
- 26 Xu Z, Miller BP, Naim O. Dynamic instrumentation of threaded applications. In PPoPP. ACM: New York, NY, US, 1999; 49–59.
- 27 Boyer M, Skadron K, Weimer W. Automated dynamic analysis of CUDA program. In Third Workshop on Software Tools for MultiCore Systems, 2008.
- 28
Coutinho B,
Sampaio D,
Pereira FMQ,
Wagner M Jr. Divergence analysis and optimizations. In PACT. IEEE: Los Alamitos, CA, USA, 2011; 320–329.
10.1109/PACT.2011.63 Google Scholar
- 29 Karrenberg R, Hack S. Whole-function vectorization. In CGO. IEEE: Los Alamitos, CA, USA, 2011; 141–150.
- 30 Stratton JA, Grover V, Marathe J, Aarts B, Murphy M, Hu Z, Hwu WW. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In CGO. IEEE: Los Alamitos, CA, USA, 2010; 111–119.
- 31
Batcher KE. Sorting networks and their applications. In AFIPS. ACM: New York, NY, US, 1968; 307–314.
10.1145/1468075.1468121 Google Scholar
- 32
Han TD,
Abdelrahman TS. Reducing branch divergence in GPU programs. In GPGPU-4. ACM: New York, NY, US, 2011; 3:1–3:8.
10.1145/1964179.1964184 Google Scholar
- 33 Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology 1981; 147(1): 195–197.
- 34
Zhang EZ,
Jiang Y,
Guo Z,
Tian K,
Shen X. On-the-fly elimination of dynamic irregularities for GPU computing. In ASPLOS. ACM: New York, NY, US, 2011; 369–380.
10.1145/1950365.1950408 Google Scholar
- 35 Collange S, Defour D, Zhang Y. Dynamic detection of uniform and affine vectors in GPGPU computations. In HPPC. Springer: Heidelberg, 2009; 46–55.
- 36 Yu Y, Acton ST. Speckle reducing anisotropic diffusion. IEEE Transactions on Image Processing 2002; 11(11): 1260–1270.
- 37
Mytkowicz T,
Diwan A,
Hauswirth M,
Sweeney PF. Evaluating the accuracy of java profilers. In PLDI. ACM: New York, NY, US, 2010; 187–197.
10.1145/1806596.1806618 Google Scholar
- 38 Jain R. The Art of Computer Systems Performance Analysis—Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley professional computing. John Wiley & Sons, Inc.: Hoboken, New Jersey, US, 1991.