Volume 27, Issue 5 pp. 1292-1309
Research Article

A survey of recent developments in parallel implementations of Gaussian elimination

Simplice Donfack

Simplice Donfack

EECS Department, University of Tennessee, Knoxville, TN, USA

Search for more papers by this author
Jack Dongarra

Jack Dongarra

EECS Department, University of Tennessee, Knoxville, TN, USA

Search for more papers by this author
Mathieu Faverge

Mathieu Faverge

IPB ENSEIRB-Matmeca, Inria Bordeaux Sud-Ouest, Bordeaux, France

Search for more papers by this author
Mark Gates

Mark Gates

EECS Department, University of Tennessee, Knoxville, TN, USA

Search for more papers by this author
Jakub Kurzak

Corresponding Author

Jakub Kurzak

EECS Department, University of Tennessee, Knoxville, TN, USA

Correspondence to: Jakub Kurzak, EECS Department, University of Tennessee, Knoxville, TN, USA.

E-mail: [email protected]

Search for more papers by this author
Piotr Luszczek

Piotr Luszczek

EECS Department, University of Tennessee, Knoxville, TN, USA

Search for more papers by this author
Ichitaro Yamazaki

Ichitaro Yamazaki

EECS Department, University of Tennessee, Knoxville, TN, USA

Search for more papers by this author
First published: 02 June 2014
Citations: 22

Summary

Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. Copyright © 2014 John Wiley & Sons, Ltd.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.