A survey of recent developments in parallel implementations of Gaussian elimination
Simplice Donfack
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorJack Dongarra
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorMathieu Faverge
IPB ENSEIRB-Matmeca, Inria Bordeaux Sud-Ouest, Bordeaux, France
Search for more papers by this authorMark Gates
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorCorresponding Author
Jakub Kurzak
EECS Department, University of Tennessee, Knoxville, TN, USA
Correspondence to: Jakub Kurzak, EECS Department, University of Tennessee, Knoxville, TN, USA.
E-mail: [email protected]
Search for more papers by this authorPiotr Luszczek
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorIchitaro Yamazaki
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorSimplice Donfack
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorJack Dongarra
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorMathieu Faverge
IPB ENSEIRB-Matmeca, Inria Bordeaux Sud-Ouest, Bordeaux, France
Search for more papers by this authorMark Gates
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorCorresponding Author
Jakub Kurzak
EECS Department, University of Tennessee, Knoxville, TN, USA
Correspondence to: Jakub Kurzak, EECS Department, University of Tennessee, Knoxville, TN, USA.
E-mail: [email protected]
Search for more papers by this authorPiotr Luszczek
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorIchitaro Yamazaki
EECS Department, University of Tennessee, Knoxville, TN, USA
Search for more papers by this authorSummary
Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. Copyright © 2014 John Wiley & Sons, Ltd.
References
- 1 Grcar JF. Mathematicians of Gaussian elimination. Notices of the AMS June/July 2011; 58(6): 782–792.
- 2 Jaeger EF, Harvey RW, Berry LA, Myra JR, Dumont RJ, Philips CK, Smithe DN, Barrett RF, Batchelor DB, Bonoli PT, Carter MD, D'Azevedo EF, D'ippolito DA, Moore RD, Wright JC. Global-wave solutions with self-consistent velocity distributions in ion cyclotron heated plasmas. Nuclear Fusion 2006; 46(7): S397–S408.
- 3 Quaranta E, Drikakis D. Noise radiation from a ducted rotor in a swirling-translating flow. Journal of Fluid Mechanics 2009; 641: 463.
- 4 Bendali A, Boubendir Y, Fares M. A feti-like domain decomposition method for coupling finite elements and boundary elements in large-size problems of acoustic scattering. Computers & Structures 2007; 85(9): 526–535.
- 5 Zhang Y, Taylor M, Sarkar T, Moon H, Yuan M. Solving large complex problems using a higher-order basis: parallel in-core and out-of-core integral-equation solvers. IEEE Antennas and Propagation Magazine 2008; 50(4): 13–30.
- 6 Barrett RF, Chan THF, D'Azevedo EF, Jaeger EF, Wong K, Wong RY. Complex version of high performance computing LINPACK benchmark (HPL). Concurrency and Computation: Practice and Experience April 10 2010; 22(5): 573–587.
- 7 Edelman A. Large dense numerical linear algebra in 1993: the parallel computing influence. International Journal of High Performance Computing Applications 1993; 7(2): 113–128.
- 8
Harrington R. Origin and development of the method of moments for field computation. IEEE Antennas and Propagation Magazine June 1990; 32: 31–35.
10.1109/74.80522 Google Scholar
- 9 Hess JL. Panel methods in computational fluid dynamics. Annual Reviews of Fluid Mechanics 1990; 22: 255–274.
- 10 Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA. The landscape of parallel computing research: a view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.
- 11
Agullo E,
Hadri B,
Ltaief H,
Dongarrra J. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM: New York, NY, USA, 2009; 1–12.
10.1145/1654059.1654080 Google Scholar
- 12 Wilkinson JH. Error analysis of direct methods of matrix inversion. Journal of the ACM 1961; 8(3): 281–330.
- 13 Trefethen L, Schreiber R. Average case analysis of Gaussian elimination. SIAM Journal on Matrix Analysis and Applicatiions 1990; 11(3): 335–360.
- 14 Yeung MC, Chan TF. Probabilistic analysis of Gaussian elimination without pivoting. Technical Report CAM95-29, Department of Mathematics, University of California, Los Angeles, 1995.
- 15
Dongarra J,
Eijkhout V,
Luszczek P. Recursive approach in sparse matrix LU factorization. Science Program January 2001; 9: 51–60.
10.1155/2001/569670 Google Scholar
- 16 Gustavson FG. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 1997; 41(6): 737–756.
- 17 Dongarra J, Faverge M, Ltaief H, Luszczek P. Exploiting fine-grain parallelism in recursive LU factorization. Parco 2011 – International Conference on Parallel Computing, Ghent, Belgium, August 30-September 2 2011.
- 18 Dongarra J, Faverge M, Ltaief H, Luszczek P. Exploiting fine-grain parallelism in recursive LU factorization. Advances in Parallel Computing, Special Issue 2012; 22: 429–436.
- 19
Dongarra J,
Faverge M,
Ltaief H,
Luszczek P. High performance matrix inversion based on LU factorization for multicore architectures. In Proceedings of the 2011 ACM International Workshop on Many Task Computing on Grids and Supercomputers, MTAGS '11. ACM: New York, NY, USA, 2011; 33–42.
10.1145/2132876.2132885 Google Scholar
- 20 Luszczek P, Dongarra J. Anatomy of a globally recursive embedded LINPACK benchmark. Proceedings of 2012 IEEE High Performance Extreme Computing Conference (HPEC 2012), Westin Hotel, Waltham, Massachusetts, September 10-12 2012. IEEE Catalog Number: CFP12HPE-CDR, ISBN: 978-1-4673-1574-6.
- 21
Castaldo AM,
Whaley RC. Scaling LAPACK panel operations using parallel cache assignment. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP'10. ACM: Bangalore, India, January 2010. (submitted to ACM TOMS).
10.1145/1693453.1693484 Google Scholar
- 22 Buttari A, Langou J, Kurzak J, Dongarra JJ. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel in Computing, Systems and Application 2009; 35: 38–53.
- 23 Joffrain T, Quintana-Orti ES, van de Geijn RA. Rapid development of high-performance out-of-core solvers. In Applied Parallel Computing. State of the Art in Scientific Computing, vol. 3732, J Dongarra, K Madsen, J Waśniewski (eds)., Lecture Notes in Computer Science. Springer: Berlin Heidelberg, 2006; 413–422.
- 24 Sorensen DC. Analysis of pairwise pivoting in Gaussian elimination. IEEE Transactions on Computers March 1985; C-34(3): 274–278.
- 25 Yip EL. FORTRAN subroutines for out-of-core solutions of large complex linear systems. Technical Report CR-159142, NASA, 1979.
- 26 Agullo E, Augonnet C, Dongarra J, Faverge M, Langou J, Ltaief H, Tomov S. LU factorization for accelerator-based systems. Proceedings of the 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA'11), December 2010; 217–224. Best Paper award.
- 27 Quintana-Ortí G, Quintana-Ortí ES, Geijn Robert AVD, Zee FGV, Chan E. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software July 2009; 36: 14:1–14:26.
- 28 Demmel J, Grigori L, Hoemmen M, Langou J. Implementing communication-optimal parallel and sequential QR factorizations. Arxiv preprint arXiv:0809.2407 2008.
- 29
Grigori L,
Demmel JW,
Xiang H. Communication avoiding Gaussian elimination. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press, 2008; 29.
10.1109/SC.2008.5214287 Google Scholar
- 30
Donfack S,
Grigori L,
Gupta AK. Adapting communication-avoiding LU and QR factorizations to multicore architectures. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2010; 1–10.
10.1109/IPDPS.2010.5470348 Google Scholar
- 31 Parker DS. A randomizing butterfly transformation useful in block matrix computations. Technical Report CSD-950024, Computer Science Department, University of California, 1995.
- 32 Baboulin M, Dongarra J, Herrmann J, Tomov S. Accelerating linear system solutions using randomization techniques. ACM Transactions on Mathematical Software 2013; 39: 8:1–8:13.
- 33 Khabou A, Demmel J, Grigori L, Gu M. Lu factorization with panel rank revealing pivoting and its communication avoiding version. Technical Report UCB/EECS-2012-15, EECS Department, University of California: Berkeley, 2012.
- 34 Grigori L, Demmel JW, Xiang H. CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications 2011; 32: 1317–1350.
- 35
Demmel JW. Applied Numerical Linear Algebra. SIAM, 1997.
10.1137/1.9781611971446 Google Scholar
- 36 Wilkinson JH. The Algebraic Eigenvalue Problem. Oxford University Press, 1965.
- 37 Gustavson FG. New generalized matrix data structures lead to a variety of high-performance algorithms. In Proceedings of the IFIP WG 2.5 Working Conference on Software Architectures for Scientific Computing Applications. Kluwer Academic Publishers: Ottawa, Canada, October 2-4 2000.
- 38 Gustavson FG, Karlsson L, Kågström B. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software 2012; 38(3): article 17.
- 39 Buttari A, Langou J, Kurzak J, Dongarra JJ. Parallel tiled QR factorization for multicore architectures. Concurrency Computation: Practice and Experience 2008; 20(13): 1573–1590.
- 40 Kurzak J, Ltaief H, Dongarra JJ, Badia RM. Scheduling dense linear algebra operations on multicore processors. Concurrency Computation: Practice and Experience 2009; 21(1): 15–44.
- 41
Haidar A,
Ltaief H,
Dongarra J. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of SC '11. ACM: New York, NY, USA, 2011; 8:1–8:11.
10.1145/2063384.2063394 Google Scholar
- 42 Haidar A, Ltaief H, Luszczek P, Dongarra J. A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society: Shanghai, China, May 21-25 2012; 25–35.
- 43 Ltaief H, Kurzak J, Dongarra J. Parallel band two-sided matrix bidiagonalization for multicore architectures. IEEE Transactions on Parallel and Distributed Systems April 2010; 21(4).
- 44 Luszczek P, Ltaief H, Dongarra J. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In IPDPS 2011: IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society: Anchorage, Alaska, USA, Unknown Month May 16; 944–955.
- 45 Badia RM, Herrero JR, Labarta JS, Pérez JM, Quintana-Ortí ES, Quintana-Ortí G. Parallelizing dense and banded linear algebra libraries using SMPSs. Concurrency and Computation: Practice and Experience 2009; 21(18): 2438–2456.
- 46 Augonnet C, Thibault S, Namyst R, Wacrenier P. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency Computation: Practice and Experience 2011; 23(2): 187–198.
- 47 Chan E, Van Zee FG, Quintana-Orti ES, Quintana-Orti G, Van De Geijn R. Satisfying your dependencies with SuperMatrix. In 2007 IEEE International Conference on Cluster Computing. IEEE, 2007; 91–99.
- 48 YarKhan A, Kurzak J, Dongarra J. QUARK users’ guide: QUeueing And Runtime for Kernels. Technical Report ICL-UT-11-02, Innovative Computing Laboratory, University of Tennessee, 2011.
- 49 Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R. hwloc: a generic framework for managing hardware affinities in HPC applications. In PDP 2010 - The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing. IEEE: Pisa, Italie, February 2010.