The present paper describes a parallel preconditioned algorithm for the solution of partial eigenvalue problems for large sparse symmetric matrices, on parallel computers. Namely, we consider the Deflation-Accelerated Conjugate Gradient (DACG) algorithm accelerated by factorized-sparse-approximate-inverse- (FSAI-) type preconditioners. We present an enhanced parallel implementation of the FSAI preconditioner and make use of the recently developed Block FSAI-IC preconditioner, which combines the FSAI and the Block Jacobi-IC preconditioners. Results onto matrices of large size arising from finite element discretization of geomechanical models reveal that DACG accelerated by these type of preconditioners is competitive with respect to the available public parallel hypre package, especially in the computation of a few of the leftmost eigenpairs. The parallel DACG code accelerated by FSAI is written in MPI-Fortran 90 language and exhibits good scalability up to one thousand processors.

1. Introduction

The computation by iterative methods of the s partial eigenspectrum of the generalized eigenproblem:

(1.1)

where A, B ∈ ℝ^n×n are large sparse symmetric positive definite (SPD) matrices, is an important and difficult task in many applications. It has become increasingly widespread owing to the development in the last twenty years of robust and computationally efficient schemes and corresponding software packages. Among the most well-known approaches for the important class of symmetric positive definite (SPD) matrices are the implicitly restarted Arnoldi method (equivalent to the Lanczos technique for this type of matrices) [1, 2], the Jacobi-Davidson (JD) algorithm [3], and schemes based on preconditioned conjugate gradient minimization of the Rayleigh quotient [4, 5].

The basic idea of the latter is to minimize the Rayleigh Quotient

(1.2)

in a subspace which is orthogonal to the previously computed eigenvectors via a preconditioned CG-like procedure. Among the different variants of this technique we chose to use the Deflation-Accelerated Conjugate Gradient (DACG) scheme [4, 6] which has been shown to be competitive with the Jacobi Davidson method and with the PARPACK package [7]. As in any other approach, for our DACG method, the choice of the preconditioning technique is a key factor to accelerate and, in some cases even to allow for, convergence. To accelerate DACG in a parallel environment we selected the Factorized Sparse Approximate inverse (FSAI) preconditioner introduced in [8]. We have developed a parallel implementation of this algorithm which has displayed excellent performances on both the setup phase and the application phase within a Krylov subspace solver [9–11]. The effectiveness of the FSAI preconditioner in the acceleration of DACG is compared to that of the Block FSAI-IC preconditioner, recently developed in [12], which combines the FSAI and the Block Jacobi-IC preconditioners obtaining good results on a small number of processors for the solution of SPD linear systems and for the solution of large eigenproblems [13]. We used the resulting parallel codes to compute a few of the leftmost eigenpairs of a set of test matrices of large size arising from Finite Element discretization of geomechanical models. The reported results show that DACG preconditioned with either FSAI or BFSAI is a scalable and robust algorithm for the partial solution of SPD eigenproblems. The parallel performance of DACG is also compared to that of the publicly available parallel package hypre [14] which implements a number of preconditioners which can be used in combination with the Locally Optimal Block PCG (LOBPCG) iterative eigensolver [15]. The results presented in this paper show that the parallel DACG code accelerated by FSAI exhibits good scalability up to one thousand processors and displays comparable performance with respect to hypre, specially when a low number of eigenpairs is sought.

The outline of the paper is as follows: in Section 2 we describe the DACG Algorithm; in Sections 3 and 4 we recall the definition and properties of the FSAI and BFSAI preconditioners, respectively. Section 5 contains the numerical results obtained with the proposed algorithm in the eigensolution of very large SPD matrices of size up to almost 7 million unknowns and 3 × 10⁸ nonzeros. A comparison with the hypre eigensolver code is also included. Section 6 ends the paper with some conclusions.

2. The DACG Iterative Eigensolver and Implementation

The DACG algorithm sequentially computes the eigenpairs, starting from the leftmost one (λ₁, u₁). To evaluate the jth eigenpair, j > 1, DACG minimizes the Rayleigh Quotient (RQ) in a subspace orthogonal to the j − 1 eigenvectors previously computed. More precisely, DACG minimizes the Rayleigh Quotient:

(2.1)

where

(2.2)

The first eigenpair (λ₁, u₁) is obtained by minimization of (2.1) with z = x(U₁ = ∅). Indicating with M the preconditioning matrix, that is, M ≈ A⁻¹, the s leftmost eigenpairs are computed by the conjugate gradient procedure [6] described in Algorithm 1.

Algorithm 1: DACG Algorithm.

Choose tolerance ε, set U = 0.
DO j = 1, s
(1) Choose x₀ such that U^Tx₀ = 0; set k = 0, β₀ = 0;
(2) ;
(3) REPEAT
(3.1) g_k ≡ ∇q(x_k) = (2/η)r_k;
(3.2) ;
(3.3) ;
(3.4) ;
(3.5)
(3.6) ,
(3.7) , with
,
Δ = (ηd − γb) ² − 4(bc − ad)(γa − ηc);
(3.8) ;
(3.9) ;
(3.10) q_k+1 ≡ q(x_k+1) = γ/η;
(3.11) k = k + 1;
(3.12)
UNTIL (q_k+1 − q_k)/q_k+1 < tol;
(4) .
END DO

The schemes relying on the Rayleigh quotient optimization are quite attractive for parallel computations; however preconditioning is an essential feature to ensure practical convergence. When seeking for an eigenpair (λ_j, u_j) it can be proved that the number of iterations is proportional to the square root of the condition number ξ_j = κ(H_j) of the Hessian of the Rayleigh quotient in the stationary point u_j [4]. It turns out that H_j is similar to (A − λ_jI)M which is not SPD. However, H_j operates on the orthogonal space spanned by the previous eigenvectors, so that the only important eigenvalues are the positive ones. In the non-preconditioned case (i.e., M = I) we would have

(2.3)

where in the ideal case M ≡ A⁻¹, we have

(2.4)

Therefore, even though A⁻¹ is not the optimal preconditioner for A − λ_jI, however, if M is a good preconditioner of A then the condition number κ(H_j) will approach ξ_j.

3. The FSAI Preconditioner

The FSAI preconditioner, initially proposed in [8, 16], has been later developed and implemented in parallel by Bergamaschi and Martínez in [9]. Here, we only shortly recall the main features of this preconditioner. Given an SPD matrix A the FSAI preconditioner approximately factorizes its inverse as a product of two sparse triangular matrices as

(3.1)

The choice of nonzeros in W is based on a sparsity pattern which in our work may be the same as

where

is the result of prefiltration [10] of A, that is, dropping of all elements below of a threshold parameter δ. The entries of W are computed by minimizing the Frobenius norm of I − WL, where L is the exact Cholesky factor of A, without forming explicitly the matrix L. The computed W is then sparsified by dropping all the elements which are below a second tolerance parameter (ε). The final FSAI preconditioner is therefore related to the following three parameters: δ, prefiltration threshold; d, power of A generating the sparsity pattern (we allow d ∈ {1,2, 4} in our experiments); ε, postfiltration threshold.

3.1. Parallel Implementation of FSAI-DACG

We have developed a parallel code written in FORTRAN 90 and which exploits the MPI library for exchanging data among the processors. We used a block row distribution of all matrices (A, W, and W^T), that is, with complete rows assigned to different processors. All these matrices are stored in static data structures in CSR format.

Regarding the preconditioner computation, we stress that any row i of matrix W of FSAI preconditioner is computed independently of each other, by solving a small SPD dense linear system of size n_i equal to the number of nonzeros allowed in row i of W. Some of the rows which contribute to form this linear system may be nonlocal to processor i and should be received from other processors. To this aim we implemented a routine called get_extra_rows which carries out all the row exchanges among the processors, before starting the computation of W, which proceed afterwards entirely in parallel. Since the number of nonlocal rows needed by each processor is relatively small we chose to temporarily replicate these rows on auxiliary data structures. Once W is obtained a parallel transposition routine provides every processor with its part of W^T.

The DACG iterative solver is essentially based on scalar and matrix-vector products. We made use of an optimized parallel matrix-vector product which has been developed in [17] showing its effectiveness up to 1024 processors.

4. Block FSAI-IC Preconditioning

The Block FSAI-IC preconditioner, BFSAI-IC in the following, is a recent development for the parallel solution to Symmetric Positive Definite (SPD) linear systems. Assume that D is an arbitrary nonsingular block diagonal matrix consisting of n_b equal size blocks.

Let 𝒮_L and 𝒮_BD be a sparse lower triangular and a dense block diagonal nonzero pattern, respectively, for an n × n matrix. Even though not strictly necessary, for the sake of simplicity assume that 𝒮_BD consists of n_b diagonal blocks with equal size m = n/n_b and let D ∈ ℝ^n×n be an arbitrary full-rank matrix with nonzero pattern 𝒮_BD.

Consider the set of lower block triangular matrices F with a prescribed nonzero pattern S_BL and minimize over F the Frobenius norm:

(4.1)

where L is the exact lower Cholesky factor of an SPD matrix A. A matrix F satisfying the minimality condition (4.1) for a given D is the lower block triangular factor of BFSAI-IC. Recalling the definition of the classical FSAI preconditioner, it can be noticed that BFSAI-IC is a block generalization of the FSAI concept.

The differentiation of (4.1) with respect to the unknown entries [F] _ij, (i, j) ∈ S_BL, yields the solution to n independent dense subsystems which, in the standard FSAI case, do not require the explicit knowledge of L. The effect of applying F to A is to concentrate the largest entries of the preconditioned matrix FAF^T into n_b diagonal blocks. However, as D is arbitrary, it is still not ensured that FAF^T is better than A in an iterative method, so it is necessary to precondition FAF^T again. As FAF^T resembles a block diagonal matrix, an efficient technique relies on using a block diagonal matrix which collects an approximation of the inverse of each diagonal block of FAF^T.

It is easy to show that F is guaranteed to exist with SPD matrices and

is SPD, too [12]. Using an IC decomposition with partial fill-in for each block

and collecting in J the lower IC factors, the resulting preconditioned matrix reads

(4.2)

with the final preconditioner

(4.3)

M in (4.3) is the BFSAI-IC preconditioner of A.

For its computation BFSAI-IC needs the selection of n_b and . The basic requirement for the number of blocks n_b is to be larger than or equal to the number of computing cores p. From a practical viewpoint, however, the most efficient choice in terms of both wall clock time and iteration count is to keep the blocks as large as possible, thus implying n_b = p. Hence, n_b is by default set equal to p. By distinction, the choice of is theoretically more challenging and still not completely clear. A widely accepted option for other approximate inverses, such as FSAI or SPAI, is to select the nonzero pattern of A^d for small values of d on the basis of the Neumann series expansion of A⁻¹. Using a similar approach, in the BFSAI construction we select as the lower block triangular pattern of A^d. As the nonzeros located in the diagonal blocks are not used for the computation of F a larger value of d, say 3 or 4, can still be used.

Though theoretically not necessary, three additional user-specified parameters are worth introducing in order to better control the memory occupation and the BFSAI-IC density:

(1)
ε is a postfiltration parameter that allows for dropping the smallest entries of F. In particular, [F] _ij is neglected if , where f_i is the ith row of F;
(2)
ρ_B is a parameter that controls the fill-in of and determines the maximum allowable number of nonzeros for each row of in addition to the corresponding entries of A. Quite obviously, the largest ρ_B entries only are retained;
(3)
ρ_L is a parameter that controls the fill-in of each IC factor denoting the maximum allowable number of nonzeros for each row of in addition to the corresponding entries of .

An OpenMP implementation of the algorithms above is available in [18].

5. Numerical Results

In this section we examine the performance of the parallel DACG preconditioned by both FSAI and BFSAI in the partial solution of four large-size sparse eigenproblems. The test cases, which we briefly describe below, are taken from different real engineering mechanical applications. In detail, they are as follows.

(i)
FAULT-639 is obtained from a structural problem discretizing a faulted gas reservoir with tetrahedral finite elements and triangular interface elements [19]. The interface elements are used with a penalty formulation to simulate the faults behavior. The problem arises from a 3D discretization with three displacement unknowns associated to each node of the grid.
(ii)
PO-878 arises in the simulation of the consolidation of a real gas reservoir of the Po Valley, Italy, used for underground gas storage purposes (for details, see [20]).
(iii)
GEO-1438 is obtained from a geomechanical problem discretizing a region of the earth crust subject to underground deformation. The computational domain is a box with an areal extent of 50 × 50 km and 10 km deep consisting of regularly shaped tetrahedral finite elements. The problem arises from a 3D discretization with three displacement unknowns associated to each node of the grid [21].
(iv)
CUBE-6091 arises from the equilibrium of a concrete cube discretized by a regular unstructured tetrahedral grid.

Matrices FAULT-639 and GEO-1438 are publicly available in the University of Florida Sparse Matrix Collection at http://www.cise.ufl.edu/research/sparse/matrices/.

In Table 1 we report sizes and nonzeros of the four matrices together with three of the most significant eigenvalues for each problem.

Table 1. Size, number of nonzeros, and three representative eigenvalues of the test matrices.

	Size	Nonzeros	λ₁	λ₁₀	λ_N
FAULT-639	638,802	28,614,564	6.99 · 10⁶	1.73 · 10⁷	2.52 · 10¹⁶
PO-878	878,355	38,847,915	1.46 · 10⁶	4.45 · 10⁶	5.42 · 10¹⁵
GEO-1438	1,437,960	63,156,690	7.81 · 10⁵	1.32 · 10⁶	1.11 · 10¹³
CUBE-6091	6,091,008	270,800,586	1.82 · 10¹	3.84 · 10²	1.05 · 10⁰⁷

The computational performance of FSAI is compared to the one obtained by using BFSAI as implemented in [12]. The comparison is done evaluating the number of iterations n_iter to converge at the same tolerance, the wall clock time in seconds T_prec and T_iter for the preconditioner computation, and the eigensolver to converge, respectively, with the total time T_tot = T_prec + T_iter. All tests are performed on the IBM SP6/5376 cluster at the CINECA Centre for High Performance Computing, equipped with IBM Power6 processors at 4.7 GHz with 168 nodes, 5376 computing cores, and 21 Tbyte of internal network RAM. The FSAI-DACG code is written in Fortran 90 and compiled with -O4 -q64 -qarch=pwr6 -qtune=pwr6 -qnoipa -qstrict -bmaxdata:0x70000000 options. For the BFSAI-IC code only an OpenMP implementation is presently available.

To study parallel performance we will use a strong scaling measure to see how the CPU times vary with the number of processors for a fixed total problem size. Denote with T_p the total CPU elapsed times expressed in seconds on p processors. We introduce a relative measure of the parallel efficiency achieved by the code,

, which is the pseudo speedup computed with respect to the smallest number of processors

used to solve a given problem. Accordingly, we will denote by

the corresponding efficiency:

(5.1)

5.1. FSAI-DACG Results

In this section we report the results of our FSAI-DACG implementation in the computation of the 10 leftmost eigenpairs of the 4 test problems. We used the exit test described in the DACG algorithm (see Algorithm 1) with tol = 10⁻¹⁰. The results are summarized in Table 2. As the FSAI parameters, we choose δ = 0.1, d = 4, and ε = 0.1 for all the test matrices. This combination of parameters produces, on the average, the best (or close to the best) performance of the iterative procedure. Note that the number of iterations does not change with the number of processors, for a fixed problem. The scalability of the code is very satisfactory in both the setup stage (preconditioner computation) and the iterative phase.

Table 2. Number of iterations, timings, and scalability indices for FSAI-DACG in the computation of the 10 leftmost eigenpairs of the four test problems.

	p	iter	T_prec	T_iter	T_tot
FAULT-639	4	4448	25.9	261.4	287.3
	8	4448	13.2	139.0	152.2	7.6	0.94
	16	4448	6.6	69.4	76.0	15.1	0.95
	32	4448	4.0	28.2	32.2	35.7	1.11
	64	4448	1.9	15.5	17.4	66.1	1.03
	128	4448	1.1	9.4	10.5	109.0	0.85

PO-878	4	5876	48.1	722.5	770.6
	8	5876	25.2	399.8	425.0	7.3	0.91
	16	5876	11.4	130.2	141.6	21.8	1.36
	32	5876	6.8	65.8	72.5	42.5	1.33
	64	5876	4.1	30.1	34.1	90.3	1.41
	128	5876	1.9	19.1	21.0	146.8	1.15

GEO-1437	4	6216	90.3	901.5	991.7
	8	6216	47.5	478.9	526.4	7.5	0.94
	16	6216	24.7	239.4	264.1	15.0	0.94
	32	6216	13.6	121.0	134.6	29.5	0.92
	64	6216	8.2	60.9	69.1	57.4	0.90
	128	6216	4.2	29.5	33.8	117.5	0.92
	256	6216	2.3	19.1	21.4	185.4	0.72

	p	iter	T_prec	T_iter	T_tot

CUBE-6091	16	15796	121.5	2624.8	2746.2
	32	15796	62.2	1343.8	1406.0	31.3	0.98
	64	15796	32.5	737.0	769.5	57.1	0.89
	128	15796	17.3	388.4	405.7	108.3	0.85
	256	15796	9.1	183.9	192.9	227.8	0.89
	512	15796	5.7	106.0	111.7	393.5	0.77
	1024	15796	3.8	76.6	80.4	546.6	0.53

5.2. BFSAI-IC-DACG Results

We present in this section the results of DACG accelerated by the BFSAI-IC preconditioner for the approximation of the s = 10 leftmost eigenpairs of the matrices described above.

Table 3 provides iteration count and total CPU time for BFSAI-DACG with different combinations of the parameters needed to construct the BFSAI-IC preconditioner for matrix PO-878 and using from 2 to 8 processors. It can be seen from Table 3 that the assessment of the optimal parameters, ε, ρ_B, and ρ_L, is not an easy task, since the number of iterations may highly vary depending on the number of processors. We chose in this case the combination of parameters producing the second smallest total time with p = 2,4, 8 processors. After intensive testing for all the test problems, we selected similarly the “optimal” values which are used in the numerical experiments reported in Table 4:

(i)
FAULT-639: d = 3, ε = 0.05, ρ_B = 10, ρ_L = 60,
(ii)
PO-878 d = 3, ε = 0.05, ρ_B = 10, ρ_L = 10,
(iii)
GEO-1438: d = 3, ε = 0.05, ρ_B = 10, ρ_L = 50,
(iv)
CUBE-6091: d = 2, ε = 0.01, ρ_B = 0, ρ_L = 10.

Table 3. Performance of BFSAI-DACG for matrix PO-878 with 2 to 8 processors and different parameter values.

d	ρ_B	ε	ρ_L	p = 2		p = 4		p = 8
d	ρ_B	ε	ρ_L	iter	T_tot	iter	T_tot	iter	T_tot
2	10	0.01	10	2333	385.76	2877	286.24	3753	273.77
2	10	0.05	10	2345	415.81	2803	245.42	3815	142.93
2	10	0.05	20	2186	370.86	2921	276.41	3445	257.18
2	10	0.00	10	2328	445.16	2880	241.23	3392	269.41
2	20	0.05	10	2340	418.20	2918	224.32	3720	253.98
3	10	0.05	10	2122	375.17	2638	228.39	3366	149.59
3	10	0.05	20	1946	433.04	2560	304.43	3254	263.51
3	10	0.05	30	1822	411.00	2481	321.30	3176	179.67
3	10	0.05	40	1729	439.47	2528	346.82	3019	188.13
4	10	0.05	10	2035	499.45	2469	350.03	3057	280.31

Table 4. Number of iterations for BFSAI-DACG in the computations of the 10 leftmost eigenpairs.

Matrix	n_b
Matrix	2	4	8	16	32	64	128	256	512	1024
FAULT-639	1357	1434	1594	2002	3053	3336	3553
PO-878	2122	2638	3366	4157	4828	5154	5373
GEO-1438	1458	1797	2113	2778	3947	4647	4850	4996
CUBE-6091			5857	6557	7746	8608	9443	9996	10189	9965

The user-specified parameters for BFSAI-IC given above provide evidence that it is important to build a dense preconditioner based on the lower nonzero pattern of A³ (except for CUBE-6091, which is built on a regular discretization) with the aim at decreasing the number of DACG iterations. Anyway, the cost for computing such a dense preconditioner appears to be almost negligible with respect to the wall clock time needed to iterate to convergence.

We recall that, presently, the code BFSAI-IC is implemented in OpenMP, and the results in terms of CPU time are significant only for p ≤ 8. For this reason the number of iterations reported in Table 4 is obtained with increasing number of blocks n_b and with p = 8 processors. This iteration number accounts for a potential implementation of BFSAI-DACG under the MPI (or hybrid OpenMP-MPI) environment as the number of iterations depends only on the number of blocks, irrespective of the number of processors.

The only meaningful comparison between FSAI-DACG and BFSAI-DACG can be carried out in terms of iteration numbers which are smaller for BFSAI-DACG for a small number of processors. The gap between FSAI and BFSAI iterations reduces when the number of processors increases.

5.3. Comparison with the LOBPCG Eigensolver Provided by hypre

In order to validate the effectiveness of our preconditioning in the proposed DACG algorithm with respect to already available public parallel eigensolvers, the results given in Tables 2 and 4 are compared with those obtained by the schemes implemented in the hypre software package [14]. The Locally Optimal Block Preconditioned Conjugate Gradient method (LOBPCG) [15] is experimented with, using the different preconditioners developed in the hypre project, that is, algebraic multigrid (AMG), diagonal scaling (DS), approximate inverse (ParaSails), additive Schwarz (Schwarz), and incomplete LU (Euclid). The hypre preconditioned CG is used for the inner iterations within LOBPCG. For details on the implementation of the LOBPCG algorithm, see, for instance, [22]. The selected preconditioner, ParaSails, is on its turn based on the FSAI preconditioner, so that the different FSAI-DACG and ParaSails-LOBPCG performances should be ascribed mainly to the different eigensolvers rather than to the preconditioners.

We first carried out a preliminary set of runs with the aim of assessing the optimal value of the block size bl parameter, that is, the size of the subspace where to seek for the eigenvectors. Obviously it must be bl ≥ s = 10. We fixed to 16 the number of processors and obtained the results summarized in Table 5 with different values of bl ∈ [10,15]. We found that, only in problem CUBE-6091, a value of bl larger than 10, namely, bl = 12, yields an improvement in the CPU time. Note that we also made this comparison with different number of processors, and we obtained analogous results.

Table 5. Iterations and CPU time for the iterative solver of LOBPCG-hypre preconditioned by Parasails with different values of bl and p = 16 processors.

Matrix	bl = 10		bl = 11		bl = 12		bl = 15
Matrix	iter	T_iter	iter	T_iter	iter	T_iter	iter	T_iter
FAULT-639	156	79.5	157	85.3	157	96.1	160	128.1
PO-878	45	117.0	41	131.6	38	151.3	35	192.6
GEO-1438	23	123.7	72	173.7	30	152.5	121	291.1
CUBE-6091	101	1670.5	143	2414.0	38	1536.7	35	1680.9

Table 6 presents the number of iterations and timings using the LOBPCG algorithm in the hypre package. The LOBPCG wall clock time is obtained with the preconditioner allowing for the best performance in the specific problem at hand, that is, ParaSails for all the problems. Using AMG as the preconditioner did not allow for convergence in three cases out of four, with the only exception of the FAULT-639 problem, in which the CPU timings were however very much larger than using ParaSails.

Table 6. Number of iterations, timings, and scalability of LOBPCG-hypre preconditioned by Parasails.

	p	iter	T_prec	T_iter	T_tot
FAULT-639 pcgitr = 5	4	155	2.5	331.2	333.7
	8	156	1.3	167.6	168.9	7.9	0.99
	16	156	0.8	79.5	80.3	16.6	1.04
	32	150	0.5	38.8	39.3	34.0	1.06
	64	145	0.3	22.2	22.5	59.4	0.93
	128	157	0.1	14.8	14.9	89.7	0.70

PO-878 pcgitr = 30	4	45	3.3	438.4	441.7
	8	50	1.3	232.3	234.0	7.6	0.94
	16	45	1.0	117.0	118.0	15.0	0.94
	32	45	0.7	63.2	63.9	27.6	0.86
	64	47	0.4	34.4	34.8	50.8	0.79
	128	41	0.3	19.44	19.74	89.5	0.70

GEO-1438 pcgitr = 30	4	26	7.7	478.0	485.7
	8	22	4.0	256.8	260.8	7.5	0.93
	16	23	2.1	123.7	125.8	15.4	0.96
	32	28	1.2	73.1	74.3	26.2	0.82
	64	23	0.8	35.5	36.3	53.5	0.84
	128	25	0.5	20.3	20.8	93.2	0.73
	256	26	0.3	12.9	13.2	147.2	0.57

	p	iter	T_prec	T_iter	T_tot

CUBE-6091	16	38	9.2	1536.7	1545.9
	32	36	4.7	807.5	812.2	30.5	0.95
	64	38	3.2	408.2	411.4	60.1	0.94
	128	41	1.6	251.4	253.0	97.8	0.76
	256	35	0.9	105.9	106.8	231.6	0.90
	512	39	0.6	65.3	65.9	375.3	0.73
	1024	37	0.3	37.7	38.0	650.9	0.64

All matrices have to be preliminarily scaled by their maximum coefficient in order to allow for convergence. To make the comparison meaningful, the outer iterations of the different methods are stopped when the average relative error measure of the computed leftmost eigenpairs gets smaller than 10⁻¹⁰, in order to obtain a comparable accuracy as in the other codes. We also report in Table 6 the number of inner preconditioned CG iterations (pcgitr).

To better compare our FSAI DACG with the LOBPCG method, we depict in Figure 1 the total CPU time versus the number of processor for the two codes. FSAI-DACG and LOBPCG provide very similar scalability, being the latter code a little bit more performing on the average. On the FAULT-639 problem, DACG reveals faster than LOBPCG, irrespective of the number of processors employed.

Details are in the caption following the image — **Figure 1 (a)**
Open in figure viewer PowerPoint

Comparison between FSAI-DACG and LOBPCG-*hypre* in terms of total CPU time for different number of processors.

Finally, we have carried out a comparison of the two eigensolvers in the computation of only the leftmost eigenpair. Differently from LOBPCG, which performs a simultaneous approximation of all the selected eigenpairs, DACG proceeds in the computation of the selected eigenpairs in a sequential way. For this reason, DACG should be the better choice, at least in principle, when just one eigenpair is sought. We investigate this feature, and the results are summarized in Table 7. We include the total CPU time and iteration count needed by LOBPCG and FSAI-DACG to compute the leftmost eigenpair with 16 processors. For the LOBPCG code we report only the number of outer iterations.

Table 7. Performance of LOBPCG-hypre with Parasails and 16 processors in the computation of the smallest eigenvalue using bl = 1 and bl = 2 and FSAI-DACG.

	LOBPCG, bl = 1		LOBPCG, bl = 2		FSAI-DACG
	iter	T_tot	iter	T_tot	iter	T_tot
FAULT-639	144	10.1	132	17.5	1030	15.4
PO-878	99	43.2	34	29.1	993	20.4
GEO-1493	55	40.2	26	37.3	754	27.0
CUBE-6091	144	5218.8	58	522.4	3257	561.1

The parameters used to construct the FSAI preconditioner for these experiments are as follows:

(1)
FAULT-639. δ = 0.1, d = 2, ε = 0.05,
(2)
PO-878. δ = 0.2, d = 4, ε = 0.1,
(3)
GEO-1438. δ = 0.1, d = 2, ε = 0.1,
(4)
CUBE-6091. δ = 0.0, d = 1, ε = 0.05.

These parameters differ from those employed to compute the FSAI preconditioner in the assessment of the 10 leftmost eigenpairs and have been selected in order to produce a preconditioner relatively cheap to compute. This is so because otherwise the setup time would prevail over the iteration time. Similarly, to compute just one eigenpair with LOBPCG we need to setup a different value for pcgitr, the number of inner iterations. As it can be seen from Table 7, in the majority of the test cases, LOBPCG takes less time to compute 2 eigenpairs than just only 1. FSAI-DACG reveals more efficient than the best LOBPCG on problems PO-878 and GEO-1438. On the remaining two problems the slow convergence exhibited by DACG is probably due to the small relative separation ξ₁ between λ₁ and λ₂.

6. Conclusions

We have presented the parallel DACG algorithm for the partial eigensolution of large and sparse SPD matrices. The scalability of DACG, accelerated with FSAI-type preconditioners, has been studied on a set of test matrices of very large size arising from real engineering mechanical applications. Our FSAI-DACG code has shown comparable performances with the LOBPCG eigensolver within the well-known public domain package, hypre. Numerical results reveal that not only the scalability achieved by our code is roughly identical to that of hypre but also, in some instances, FSAI-DACG proves more efficient in terms of absolute CPU time. In particular, for the computation of the leftmost eigenpair, FSAI-DACG is more convenient in 2 problems out of 4.

Acknowledgment

The authors acknowledge the CINECA Iscra Award SCALPREC (2011) for the availability of HPC resources and support.

References

1 Arbenz P., Hetmaniuk U., Lehoucq R., and Tuminaro R., A comparison of eigensolvers for large-scale 3D modal analysis using AMG-preconditioned iterative methods, International Journal for Numerical Methods in Engineering. (2005) 64, no. 2, 204–236, https://doi.org/10.1002/nme.1365, ZBL1093.74021.
10.1002/nme.1365
Web of Science® Google Scholar
2 Lehoucq R. B. and Sorensen D. C., Deflation techniques for an implicitly restarted Arnoldi iteration, SIAM Journal on Matrix Analysis and Applications. (1996) 17, no. 4, 789–821, 1410702, https://doi.org/10.1137/S0895479895281484, ZBL0863.65016.
10.1137/S0895479895281484
Web of Science® Google Scholar
3 Sleijpen G. L. G. and Van der Vorst H. A., A Jacobi-Davidson iteration method for linear eigenvalue problems, SIAM Journal on Matrix Analysis and Applications. (1996) 17, no. 2, 401–425, 1384515, https://doi.org/10.1137/S0895479894270427, ZBL0860.65023.
10.1137/S0895479894270427
Web of Science® Google Scholar
4 Bergamaschi L., Gambolati G., and Pini G., Asymptotic convergence of conjugate gradient methods for the partial symmetric eigenproblem, Numerical Linear Algebra with Applications. (1997) 4, no. 2, 69–84, https://doi.org/10.1002/(SICI)1099%2D1506(199703/04)4:2<69::AID%2DNLA98>3.3.CO;2%2D6, 1443596, ZBL0889.65032.
10.1002/(SICI)1099-1506(199703/04)4:2<69::AID-NLA98>3.0.CO;2-F
Web of Science® Google Scholar
5 Knyazev A. V. and Skorokhodov A. L., Preconditioned gradient-type iterative methods in a subspace for partial generalized symmetric eigenvalue problems, SIAM Journal on Numerical Analysis. (1994) 31, no. 4, 1226–1239, 1286225, https://doi.org/10.1137/0731064, ZBL0812.65031.
10.1137/0731064
Web of Science® Google Scholar
6 Bergamaschi L., Pini G., and Sartoretto F., Approximate inverse preconditioning in the parallel solution of sparse eigenproblems, Numerical Linear Algebra with Applications. (2000) 7, no. 3, 99–116, https://doi.org/10.1002/(SICI)1099%2D1506(200004/05)7:3<99::AID%2DNLA188>3.3.CO;2%2DX, 1756946, ZBL0982.65038.
10.1002/(SICI)1099-1506(200004/05)7:3<99::AID-NLA188>3.0.CO;2-5
Web of Science® Google Scholar
7 Bergamaschi L. and [email protected], Putti M., Numerical comparison of iterative eigensolvers for large sparse symmetric positive definite matrices, Computer Methods in Applied Mechanics and Engineering. (2002) 191, no. 45, 5233–5247, https://doi.org/10.1016/S0045%2D7825(02)00457%2D7, ZBL1016.65013.
10.1016/S0045-7825(02)00457-7
Web of Science® Google Scholar
8 Kolotilina L. Yu. and Yeremin A. Yu., Factorized sparse approximate inverse preconditionings. I. Theory, SIAM Journal on Matrix Analysis and Applications. (1993) 14, no. 1, 45–58, 1199543, https://doi.org/10.1137/0614004, ZBL0767.65037.
10.1137/0614004
Web of Science® Google Scholar
9 Bergamaschi L. and Martínez A., M. Dayd et al., Parallel acceleration of Krylov solvers by factorized approximate inverse preconditioners, VECPAR 2004, 2005, 3402, Springer, Heidelberg, Germany, 623–636, Lecture Notes in Computer Sciences, ZBL1118.65400.
10.1007/11403937_47
Google Scholar
10 Bergamaschi L., Martínez A., and Pini G., An efficient parallel MLPG method for poroelastic models, Computer Modeling in Engineering & Sciences. (2009) 49, no. 3, 191–215, 2574353.
Web of Science® Google Scholar
11 Bergamaschi L. and Martínez A., R. N. E. Jeannot and J. Roman, Parallel inexact constraint preconditioners for saddle point problems, 6853, part 2, Proceedings of the 17th International Conference on Parallel Processing (Euro-Par′11), 2011, Bordeaux, France, Springer, 78–89, Lecture Notes in Computer Sciences.
Google Scholar
12 Janna C., Ferronato M., and Gambolati G., A block FSAI-ILU parallel preconditioner for symmetric positive definite linear systems, SIAM Journal on Scientific Computing. (2010) 32, no. 5, 2468–2484, 2684723, https://doi.org/10.1137/090779760, ZBL1220.65037.
10.1137/090779760
Web of Science® Google Scholar
13 Ferronato M., Janna C., and Pini G., Efficient parallel solution to large-size sparse eigenproblems with block FSAI preconditioning, Numerical Linear Algebra with Applications. In presshttps://doi.org/10.1002/nla.813.
10.1002/nla.813
Google Scholar
14 Lawrence Livermore National Laboratory, hypre user manual, software version 1.6.0. Center for Applied Scientific Computing (CASC), University of California, 2001.
Google Scholar
15 Knyazev A. V., Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method, SIAM Journal on Scientific Computing. (2001) 23, no. 2, 517–541, 1861263, https://doi.org/10.1137/S1064827500366124, ZBL0992.65028.
10.1137/S1064827500366124
Web of Science® Google Scholar
16 Kolotilina L. Yu., Nikishin A. A., and Yeremin A. Yu., Factorized sparse approximate inverse preconditionings. IV. Simple approaches to rising efficiency, Numerical Linear Algebra with Applications. (1999) 6, no. 7, 515–531, 1731624, https://doi.org/10.1002/(SICI)1099%2D1506(199910/11)6:7<515::AID%2DNLA176>3.0.CO;2%2D0, ZBL0983.65062.
10.1002/(SICI)1099-1506(199910/11)6:7<515::AID-NLA176>3.0.CO;2-0
Web of Science® Google Scholar
17 Martínez A., Bergamaschi L., Caliari M., and Vianello M., A massively parallel exponential integrator for advection-diffusion models, Journal of Computational and Applied Mathematics. (2009) 231, no. 1, 82–91, 2532651, https://doi.org/10.1016/j.cam.2009.01.024, ZBL1167.65443.
10.1016/j.cam.2009.01.024
Web of Science® Google Scholar
18 Janna C., Ferronato M., and Castelletto N., BFSAI-IC OpenMP implementation, Release V1.0, January 2011, http://www.dmsa.unipd.it/~ferronat/software.html.
Google Scholar
19 Ferronato M., Janna C., and Gambolati G., Mixed constraint preconditioning in computational contact mechanics, Computer Methods in Applied Mechanics and Engineering. (2008) 197, no. 45–48, 3922–3931, 2458120, https://doi.org/10.1016/j.cma.2008.03.008, ZBL1194.74522.
10.1016/j.cma.2008.03.008
Web of Science® Google Scholar
20 Castelletto N., Ferronato M., Gambolati G. et al., 3D geomechanics in UGS projects: a comprehensive study in northern Italy, Proceedings of the 44th US Rock Mechanics Symposium, 2010, Salt Lake City, Utah, USA, https://doi.org/10.1016/S0045%2D7825(02)00457%2D7.
10.1016/S0045-7825(02)00457-7
Google Scholar
21 Teatini P., Ferronato M., Gambolati G., Bau D., and Putti M., An-thropogenic Venice uplift by seawater pumping into a heterogeneous aquifer system, Water Resources Research. (2010) 46, 16, W11547, https://doi.org/10.1029/2010WR009161.
10.1029/2010WR009161
Web of Science® Google Scholar
22 Knyazev A. V. and Argentati M. E., Implementation of a preconditioned eigensolver using hypre, 2005, http://math.ucdenver.edu/~rargenta/index_files/rep220.pdf.
Google Scholar

Citing Literature

All articles

Parallel Rayleigh Quotient Optimization with FSAI-Based Preconditioning

Abstract

1. Introduction

2. The DACG Iterative Eigensolver and Implementation

3. The FSAI Preconditioner

3.1. Parallel Implementation of FSAI-DACG

4. Block FSAI-IC Preconditioning

5. Numerical Results

5.1. FSAI-DACG Results

5.2. BFSAI-IC-DACG Results

5.3. Comparison with the LOBPCG Eigensolver Provided by hypre

6. Conclusions

Acknowledgment

References

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Parallel Rayleigh Quotient Optimization with FSAI-Based Preconditioning

Abstract

1. Introduction

2. The DACG Iterative Eigensolver and Implementation

3. The FSAI Preconditioner

3.1. Parallel Implementation of FSAI-DACG

4. Block FSAI-IC Preconditioning

5. Numerical Results

5.1. FSAI-DACG Results

5.2. BFSAI-IC-DACG Results

5.3. Comparison with the LOBPCG Eigensolver Provided by hypre

6. Conclusions

Acknowledgment

References

Citing Literature

Figures

References

Related

Information