Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. Copyright © 2014 John Wiley & Sons, Ltd.

References

1 Grcar JF. Mathematicians of Gaussian elimination. Notices of the AMS June/July 2011; 58(6): 782–792.
Google Scholar
2 Jaeger EF, Harvey RW, Berry LA, Myra JR, Dumont RJ, Philips CK, Smithe DN, Barrett RF, Batchelor DB, Bonoli PT, Carter MD, D'Azevedo EF, D'ippolito DA, Moore RD, Wright JC. Global-wave solutions with self-consistent velocity distributions in ion cyclotron heated plasmas. Nuclear Fusion 2006; 46(7): S397–S408.
10.1088/0029-5515/46/7/S02
CAS Web of Science® Google Scholar
3 Quaranta E, Drikakis D. Noise radiation from a ducted rotor in a swirling-translating flow. Journal of Fluid Mechanics 2009; 641: 463.
10.1017/S0022112009991972
Web of Science® Google Scholar
4 Bendali A, Boubendir Y, Fares M. A feti-like domain decomposition method for coupling finite elements and boundary elements in large-size problems of acoustic scattering. Computers & Structures 2007; 85(9): 526–535.
10.1016/j.compstruc.2006.08.029
Web of Science® Google Scholar
5 Zhang Y, Taylor M, Sarkar T, Moon H, Yuan M. Solving large complex problems using a higher-order basis: parallel in-core and out-of-core integral-equation solvers. IEEE Antennas and Propagation Magazine 2008; 50(4): 13–30.
10.1109/MAP.2008.4653660
PubMed Web of Science® Google Scholar
6 Barrett RF, Chan THF, D'Azevedo EF, Jaeger EF, Wong K, Wong RY. Complex version of high performance computing LINPACK benchmark (HPL). Concurrency and Computation: Practice and Experience April 10 2010; 22(5): 573–587.
10.1002/cpe.1476
Web of Science® Google Scholar
7 Edelman A. Large dense numerical linear algebra in 1993: the parallel computing influence. International Journal of High Performance Computing Applications 1993; 7(2): 113–128.
10.1177/109434209300700203
Web of Science® Google Scholar
8 Harrington R. Origin and development of the method of moments for field computation. IEEE Antennas and Propagation Magazine June 1990; 32: 31–35.
10.1109/74.80522
Google Scholar
9 Hess JL. Panel methods in computational fluid dynamics. Annual Reviews of Fluid Mechanics 1990; 22: 255–274.
10.1146/annurev.fl.22.010190.001351
Web of Science® Google Scholar
10 Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA. The landscape of parallel computing research: a view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.
Google Scholar
11 Agullo E, Hadri B, Ltaief H, Dongarrra J. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM: New York, NY, USA, 2009; 1–12.
10.1145/1654059.1654080
Google Scholar
12 Wilkinson JH. Error analysis of direct methods of matrix inversion. Journal of the ACM 1961; 8(3): 281–330.
10.1145/321075.321076
Web of Science® Google Scholar
13 Trefethen L, Schreiber R. Average case analysis of Gaussian elimination. SIAM Journal on Matrix Analysis and Applicatiions 1990; 11(3): 335–360.
10.1137/0611023
Web of Science® Google Scholar
14 Yeung MC, Chan TF. Probabilistic analysis of Gaussian elimination without pivoting. Technical Report CAM95-29, Department of Mathematics, University of California, Los Angeles, 1995.
Google Scholar
15 Dongarra J, Eijkhout V, Luszczek P. Recursive approach in sparse matrix LU factorization. Science Program January 2001; 9: 51–60.
10.1155/2001/569670
Google Scholar
16 Gustavson FG. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 1997; 41(6): 737–756.
10.1147/rd.416.0737
Web of Science® Google Scholar
17 Dongarra J, Faverge M, Ltaief H, Luszczek P. Exploiting fine-grain parallelism in recursive LU factorization. Parco 2011 – International Conference on Parallel Computing, Ghent, Belgium, August 30-September 2 2011.
Google Scholar
18 Dongarra J, Faverge M, Ltaief H, Luszczek P. Exploiting fine-grain parallelism in recursive LU factorization. Advances in Parallel Computing, Special Issue 2012; 22: 429–436.
Web of Science® Google Scholar
19 Dongarra J, Faverge M, Ltaief H, Luszczek P. High performance matrix inversion based on LU factorization for multicore architectures. In Proceedings of the 2011 ACM International Workshop on Many Task Computing on Grids and Supercomputers, MTAGS '11. ACM: New York, NY, USA, 2011; 33–42.
10.1145/2132876.2132885
Google Scholar
20 Luszczek P, Dongarra J. Anatomy of a globally recursive embedded LINPACK benchmark. Proceedings of 2012 IEEE High Performance Extreme Computing Conference (HPEC 2012), Westin Hotel, Waltham, Massachusetts, September 10-12 2012. IEEE Catalog Number: CFP12HPE-CDR, ISBN: 978-1-4673-1574-6.
Google Scholar
21 Castaldo AM, Whaley RC. Scaling LAPACK panel operations using parallel cache assignment. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP'10. ACM: Bangalore, India, January 2010. (submitted to ACM TOMS).
10.1145/1693453.1693484
Google Scholar
22 Buttari A, Langou J, Kurzak J, Dongarra JJ. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel in Computing, Systems and Application 2009; 35: 38–53.
10.1016/j.parco.2008.10.002
Web of Science® Google Scholar
23 Joffrain T, Quintana-Orti ES, van de Geijn RA. Rapid development of high-performance out-of-core solvers. In Applied Parallel Computing. State of the Art in Scientific Computing, vol. 3732, J Dongarra, K Madsen, J Waśniewski (eds)., Lecture Notes in Computer Science. Springer: Berlin Heidelberg, 2006; 413–422.
10.1007/11558958_49
Web of Science® Google Scholar
24 Sorensen DC. Analysis of pairwise pivoting in Gaussian elimination. IEEE Transactions on Computers March 1985; C-34(3): 274–278.
10.1109/TC.1985.1676570
Web of Science® Google Scholar
25 Yip EL. FORTRAN subroutines for out-of-core solutions of large complex linear systems. Technical Report CR-159142, NASA, 1979.
Google Scholar
26 Agullo E, Augonnet C, Dongarra J, Faverge M, Langou J, Ltaief H, Tomov S. LU factorization for accelerator-based systems. Proceedings of the 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA'11), December 2010; 217–224. Best Paper award.
Google Scholar
27 Quintana-Ortí G, Quintana-Ortí ES, Geijn Robert AVD, Zee FGV, Chan E. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software July 2009; 36: 14:1–14:26.
10.1145/1527286.1527288
Web of Science® Google Scholar
28 Demmel J, Grigori L, Hoemmen M, Langou J. Implementing communication-optimal parallel and sequential QR factorizations. Arxiv preprint arXiv:0809.2407 2008.
Google Scholar
29 Grigori L, Demmel JW, Xiang H. Communication avoiding Gaussian elimination. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Press, 2008; 29.
10.1109/SC.2008.5214287
Google Scholar
30 Donfack S, Grigori L, Gupta AK. Adapting communication-avoiding LU and QR factorizations to multicore architectures. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2010; 1–10.
10.1109/IPDPS.2010.5470348
Google Scholar
31 Parker DS. A randomizing butterfly transformation useful in block matrix computations. Technical Report CSD-950024, Computer Science Department, University of California, 1995.
Google Scholar
32 Baboulin M, Dongarra J, Herrmann J, Tomov S. Accelerating linear system solutions using randomization techniques. ACM Transactions on Mathematical Software 2013; 39: 8:1–8:13.
10.1145/2427023.2427025
Web of Science® Google Scholar
33 Khabou A, Demmel J, Grigori L, Gu M. Lu factorization with panel rank revealing pivoting and its communication avoiding version. Technical Report UCB/EECS-2012-15, EECS Department, University of California: Berkeley, 2012.
Google Scholar
34 Grigori L, Demmel JW, Xiang H. CALU: A communication optimal LU factorization algorithm. SIAM Journal on Matrix Analysis and Applications 2011; 32: 1317–1350.
10.1137/100788926
Web of Science® Google Scholar
35 Demmel JW. Applied Numerical Linear Algebra. SIAM, 1997.
10.1137/1.9781611971446
Google Scholar
36 Wilkinson JH. The Algebraic Eigenvalue Problem. Oxford University Press, 1965.
Google Scholar
37 Gustavson FG. New generalized matrix data structures lead to a variety of high-performance algorithms. In Proceedings of the IFIP WG 2.5 Working Conference on Software Architectures for Scientific Computing Applications. Kluwer Academic Publishers: Ottawa, Canada, October 2-4 2000.
Google Scholar
38 Gustavson FG, Karlsson L, Kågström B. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software 2012; 38(3): article 17.
10.1145/2168773.2168775
Web of Science® Google Scholar
39 Buttari A, Langou J, Kurzak J, Dongarra JJ. Parallel tiled QR factorization for multicore architectures. Concurrency Computation: Practice and Experience 2008; 20(13): 1573–1590.
10.1002/cpe.1301
Web of Science® Google Scholar
40 Kurzak J, Ltaief H, Dongarra JJ, Badia RM. Scheduling dense linear algebra operations on multicore processors. Concurrency Computation: Practice and Experience 2009; 21(1): 15–44.
Google Scholar
41 Haidar A, Ltaief H, Dongarra J. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of SC '11. ACM: New York, NY, USA, 2011; 8:1–8:11.
10.1145/2063384.2063394
Google Scholar
42 Haidar A, Ltaief H, Luszczek P, Dongarra J. A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society: Shanghai, China, May 21-25 2012; 25–35.
10.1109/IPDPS.2012.13
Web of Science® Google Scholar
43 Ltaief H, Kurzak J, Dongarra J. Parallel band two-sided matrix bidiagonalization for multicore architectures. IEEE Transactions on Parallel and Distributed Systems April 2010; 21(4).
10.1109/TPDS.2009.79
Web of Science® Google Scholar
44 Luszczek P, Ltaief H, Dongarra J. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In IPDPS 2011: IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society: Anchorage, Alaska, USA, Unknown Month May 16; 944–955.
Google Scholar
45 Badia RM, Herrero JR, Labarta JS, Pérez JM, Quintana-Ortí ES, Quintana-Ortí G. Parallelizing dense and banded linear algebra libraries using SMPSs. Concurrency and Computation: Practice and Experience 2009; 21(18): 2438–2456.
10.1002/cpe.1463
Web of Science® Google Scholar
46 Augonnet C, Thibault S, Namyst R, Wacrenier P. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency Computation: Practice and Experience 2011; 23(2): 187–198.
10.1002/cpe.1631
Web of Science® Google Scholar
47 Chan E, Van Zee FG, Quintana-Orti ES, Quintana-Orti G, Van De Geijn R. Satisfying your dependencies with SuperMatrix. In 2007 IEEE International Conference on Cluster Computing. IEEE, 2007; 91–99.
10.1109/CLUSTR.2007.4629221
Web of Science® Google Scholar
48 YarKhan A, Kurzak J, Dongarra J. QUARK users’ guide: QUeueing And Runtime for Kernels. Technical Report ICL-UT-11-02, Innovative Computing Laboratory, University of Tennessee, 2011.
Google Scholar
49 Broquedis F, Clet-Ortega J, Moreaud S, Furmento N, Goglin B, Mercier G, Thibault S, Namyst R. hwloc: a generic framework for managing hardware affinities in HPC applications. In PDP 2010 - The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing. IEEE: Pisa, Italie, February 2010.
10.1109/PDP.2010.67
Web of Science® Google Scholar

Citing Literature

Volume27, Issue5

10 April 2015

Pages 1292-1309

A survey of recent developments in parallel implementations of Gaussian elimination

Summary

References

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

A survey of recent developments in parallel implementations of Gaussian elimination

Summary

References

Citing Literature

References

Related

Information