The increasing programmability and the high computational power of graphics processing units make them attractive to general purpose programming. However, taking full benefit of this execution environment is a challenging task. One of these challenges stems from divergences, a phenomenon that occurs when threads that execute in lock-step are forced to take different program paths because of branches in the code. In face of divergences, some threads will have to wait, idly, while their diverging siblings execute. Optimizing the code to avoid divergences is difficult because this task demands a deep understanding of programs that might be large and convoluted. To facilitate the detection of divergences, this paper introduces the divergence map, a data structure that indicates the location and the volume of divergences in a program. We build this map via dynamic profiling techniques, which we have implemented on top of an open source Parallel Thread Execution compiler. To illustrate the importance of the divergence map, we have used it to pinpoint the core regions that must be optimized in well-known public applications. By hand optimizing some applications, we have added 9–11% speedups onto kernels that have already gone through the sieve of many programmers. Copyright © 2012 John Wiley & Sons, Ltd.

REFERENCES

1 Garland M. Parallel computing experiences with CUDA. IEEE Micro 2008; 28: 13–27.
10.1109/MM.2008.57
Web of Science® Google Scholar
2 Nickolls J, Dally WJ. The GPU computing era. IEEE Micro 2010; 30: 56–69.
10.1109/MM.2010.41
Web of Science® Google Scholar
3 Nickolls J, Kirk D. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy), 4th ed., chapter A. Elsevier: Amsterdam, Netherlands, 2009. pages A.1–A.77.
Google Scholar
4 Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW. Optimization principles and application performance evaluation of a multithreaded GPU using cuda. In PPoPP, ACM: New York, NY, US, 2008; 73–82.
10.1145/1345206.1345220
Web of Science® Google Scholar
5 Seiler L, Carmean D, Sprangle E, Forsyth T, Abrash M, Dubey P, Junkins S, Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P. Larrabee: a many-core x86 architecture for visual computing. ACM Transactions on Graphics 2008; 27(3): 1–15.
10.1145/1360612.1360617
Web of Science® Google Scholar
6 Saha B, Zhou X, Chen H, Gao Y, Yan S, Rajagopalan M, Fang J, Zhang P, Ronen R, Mendelson A. Programming model for a heterogeneous x86 platform. In PLDI. ACM: New York, NY, US, 2009. 431–440.
10.1145/1542476.1542525
Web of Science® Google Scholar
7 Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WW. An adaptive performance modeling tool for GPU architectures. In PPoPP. ACM: New York, NY, US, 2010; 105–114.
10.1145/1693453.1693470
Google Scholar
8 Harris M. The parallel prefix sum (scan) with CUDA. Technical Report Initial release on February 14, 2007, NVIDIA, 2008.
Google Scholar
9 Kerr A, Diamos GF, Yalamanchili S. A characterization and analysis of PTX kernels. In IISWC. IEEE: Los Alamitos, CA, USA, 2009; 3–12.
10.1109/IISWC.2009.5306801
Google Scholar
10 Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K. Rodinia: a benchmark suite for heterogeneous computing. In IISWC. IEEE: Los Alamitos, CA, USA, 2009; 44–54.
10.1109/IISWC.2009.5306797
Web of Science® Google Scholar
11 Cederman D, Tsigas P. GPU-quicksort: a practical quicksort algorithm for graphics processors. Journal of Experimental Algorithmics 2009; 14(1): 4–24.
Google Scholar
12 Coutinho B, Sampaio D, Pereira FMQ, Wagner M, Jr. Performance debugging of GPGPU applications with the divergence map. In SBAC-PAD. IEEE: Los Alamitos, CA, USA, 2010; 33–40.
10.1109/SBAC-PAD.2010.38
Web of Science® Google Scholar
13 Choi JW, Singh A, Vuduc RW. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In PPoPP. ACM: New York, NY, US, 2010; 115–126.
10.1145/1693453.1693471
Google Scholar
14 Sandes EFO, de Melo ACMA. Cudalign: using GPU to accelerate the comparison of megabase genomic sequences. In PPoPP. ACM: New York, NY, US, 2010; 137–146.
10.1145/1693453.1693473
Google Scholar
15 Lee S, Min S-J, Eigenmann R. Openmp to GPGPU: a compiler framework for automatic translation and optimization. In PPoPP. ACM: New York, NY, US, 2009; 101–110.
10.1145/1594835.1504194
Google Scholar
16 Fung WWL, Sham I, Yuan G, Aamodt TM. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO. IEEE: Los Alamitos, CA, USA, 2007; 407–420.
10.1109/MICRO.2007.30
Web of Science® Google Scholar
17 Graham SL, Kessler PB, McKusick MK. GPROF: a call graph execution profiler (with retrospective). In Best of PLDI. ACM: New York, NY, US, 1982; 49–57.
10.1145/800230.806987
Web of Science® Google Scholar
18 Chang PP, Mahlke SA, Hwu WW. Using profile information to assist classic code optimizations. Software Practice and Experience 1991; 21(12): 1301–1321.
10.1002/spe.4380211204
Web of Science® Google Scholar
19 Nethercote N, Seward J. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI. ACM, 2007; 89–100.
10.1145/1250734.1250746
Google Scholar
20 Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B. Simics: a full system simulation platform. Computer 2002; 35: 50–58.
10.1109/2.982916
Web of Science® Google Scholar
21 Srivastava A, Eustace A. Atom: a system for building customized program analysis tools. In PLDI. ACM, New York, NY, US, 1994; 196–205.
10.1145/178243.178260
Web of Science® Google Scholar
22 Eyerman S, Eeckhout L. Per-thread cycle accounting in SMT processors. ASPLOS 2009; 44(3): 133–144.
10.1145/1508244.1508260
Web of Science® Google Scholar
23 Ji M, Felten EW, Li K. Performance measurements for multithreaded programs. In SIGMETRICS. ACM: New York, NY, US, 1998; 161–170.
10.1145/277851.277900
Web of Science® Google Scholar
24 Miller BP, Callaghan MD, Cargille JM, Hollingsworth JK, Irvin RB, Karavanic KL, Kunchithapadam K, Newhall T. The paradyn parallel performance measurement tool. Computer 1995; 28: 37–46.
10.1109/2.471178
Web of Science® Google Scholar
25 Tallent NR, Mellor-Crummey JM. Effective performance measurement and analysis of multithreaded applications. PPOPP 2009; 44(4): 229–240.
Web of Science® Google Scholar
26 Xu Z, Miller BP, Naim O. Dynamic instrumentation of threaded applications. In PPoPP. ACM: New York, NY, US, 1999; 49–59.
Google Scholar
27 Boyer M, Skadron K, Weimer W. Automated dynamic analysis of CUDA program. In Third Workshop on Software Tools for MultiCore Systems, 2008.
Google Scholar
28 Coutinho B, Sampaio D, Pereira FMQ, Wagner M Jr. Divergence analysis and optimizations. In PACT. IEEE: Los Alamitos, CA, USA, 2011; 320–329.
10.1109/PACT.2011.63
Google Scholar
29 Karrenberg R, Hack S. Whole-function vectorization. In CGO. IEEE: Los Alamitos, CA, USA, 2011; 141–150.
10.1109/CGO.2011.5764682
Web of Science® Google Scholar
30 Stratton JA, Grover V, Marathe J, Aarts B, Murphy M, Hu Z, Hwu WW. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In CGO. IEEE: Los Alamitos, CA, USA, 2010; 111–119.
10.1145/1772954.1772971
Web of Science® Google Scholar
31 Batcher KE. Sorting networks and their applications. In AFIPS. ACM: New York, NY, US, 1968; 307–314.
10.1145/1468075.1468121
Google Scholar
32 Han TD, Abdelrahman TS. Reducing branch divergence in GPU programs. In GPGPU-4. ACM: New York, NY, US, 2011; 3:1–3:8.
10.1145/1964179.1964184
Google Scholar
33 Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology 1981; 147(1): 195–197.
10.1016/0022-2836(81)90087-5
CAS PubMed Web of Science® Google Scholar
34 Zhang EZ, Jiang Y, Guo Z, Tian K, Shen X. On-the-fly elimination of dynamic irregularities for GPU computing. In ASPLOS. ACM: New York, NY, US, 2011; 369–380.
10.1145/1950365.1950408
Google Scholar
35 Collange S, Defour D, Zhang Y. Dynamic detection of uniform and affine vectors in GPGPU computations. In HPPC. Springer: Heidelberg, 2009; 46–55.
Google Scholar
36 Yu Y, Acton ST. Speckle reducing anisotropic diffusion. IEEE Transactions on Image Processing 2002; 11(11): 1260–1270.
10.1109/TIP.2002.804276
PubMed Web of Science® Google Scholar
37 Mytkowicz T, Diwan A, Hauswirth M, Sweeney PF. Evaluating the accuracy of java profilers. In PLDI. ACM: New York, NY, US, 2010; 187–197.
10.1145/1806596.1806618
Google Scholar
38 Jain R. The Art of Computer Systems Performance Analysis—Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley professional computing. John Wiley & Sons, Inc.: Hoboken, New Jersey, US, 1991.
Google Scholar

Citing Literature

Volume25, Issue6

Special Issue:Latest Trends in Computer Architectures and Parallel and Distributed Technologies

25 April 2013

Pages 775-789

Profiling divergences in GPU applications

SUMMARY

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Profiling divergences in GPU applications

SUMMARY

REFERENCES

Citing Literature

References

Related

Information