Research Article

Full Access

A Kernel Regression Approach to Gene-Gene Interaction Detection for Case-Control Studies

Corresponding Author

Nicholas B. Larson

Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota

Correspondence to: Nicholas B. Larson, Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905. E-mail: [email protected]Search for more papers by this author

Daniel J. Schaid,

Daniel J. Schaid

Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota

Search for more papers by this author

Nicholas B. Larson,

Corresponding Author

Nicholas B. Larson

Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota

Daniel J. Schaid,

Daniel J. Schaid

Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota

Search for more papers by this author

First published: 19 July 2013

https://doi.org/10.1002/gepi.21749

Citations: 19

Share a link

Email
Wechat
Bluesky

ABSTRACT

Gene-gene interactions are increasingly being addressed as a potentially important contributor to the variability of complex traits. Consequently, attentions have moved beyond single locus analysis of association to more complex genetic models. Although several single-marker approaches toward interaction analysis have been developed, such methods suffer from very high testing dimensionality and do not take advantage of existing information, notably the definition of genes as functional units. Here, we propose a comprehensive family of gene-level score tests for identifying genetic elements of disease risk, in particular pairwise gene-gene interactions. Using kernel machine methods, we devise score-based variance component tests under a generalized linear mixed model framework. We conducted simulations based upon coalescent genetic models to evaluate the performance of our approach under a variety of disease models. These simulations indicate that our methods are generally higher powered than alternative gene-level approaches and at worst competitive with exhaustive SNP-level (where SNP is single-nucleotide polymorphism) analyses. Furthermore, we observe that simulated epistatic effects resulted in significant marginal testing results for the involved genes regardless of whether or not true main effects were present. We detail the benefits of our methods and discuss potential genome-wide analysis strategies for gene-gene interaction analysis in a case-control study design.

Introduction

Genome-wide association studies (GWAS) are a popular approach toward investigating the genetic component of complex diseases. Through the use high-throughput genotyping chips, GWAS can simultaneously characterize hundreds of thousands of single-nucleotide polymorphisms (SNPs) for a given subject. Analysis of GWAS data typically involves the isolated evaluation of individual SNPs for association with a given phenotype. Despite much success in identification of associated loci [Hindorff et al., 2009], such findings generally are of modest effect and often explain only a small proportion of heritability in complex phenotypes [Manolio et al., 2009]. This “missing heritability” has prompted investigators to consider alternative sources of genetic variation in association analysis.

It is well established that coding products of some genes interact with one another molecularly in complex networks, such as enzymatic reactions and signaling cascades [Bonetta, 2010]. Such interactions may contribute to the genetic variation of complex traits [Moore, 2003], with multiple examples documented [Howard et al., 2002; Li et al., 2012; Moore and Williams, 2002; Sima et al., 2012]. Statistically, gene-gene interactions are defined as deviations from additive marginal effects of individual genes [Kempthorne, 1954], and our reference of gene-gene interactions hereafter is with respect to such. In regard to genotyping data, pairwise gene-gene interactions can be considered at the SNP level as statistical interactions between two SNPs in respective genes of interest. Similar to single marker regression analysis, SNP-SNP interaction analysis can be framed as a traditional regression-based analysis by including pairwise interaction terms into a generalized linear model. It is important to note that this definition of interaction does necessarily coincide with the biological interpretation of interaction, and that one does not necessarily imply the other [Greenland, 2009]. Although the utility of identifying such interactions with respect to explaining missing heritability is contentious [Aschard et al., 2012; Moore and Williams, 2009], such interactions can at the very least contribute to our understanding of complex disease etiology.

Advancements in both genotyping technology and imputation methodology have increased the density of genotyped markers in the coding regions of genes. Moreover, large-scale next-generation sequencing technologies, such as whole exome/genome sequencing, interrogate all genetic variation within regions of interest. Unlike traditional GWAS, these tools yield dense genotype data. Under such conditions, exhaustive genome-wide evaluation of SNP-level pairwise interaction is computationally burdensome [Moore and Ritchie, 2004]. Thus, the development of statistically powerful and computationally efficient algorithms for detecting these interactions is of great interest. A comprehensive review of gene-gene interaction analysis can be found by Cordell [2009].

Gene-level testing has recently grown in popularity due to its dimensional reduction and biological interpretability [Jorgenson and Witte, 2006; Neale and Sham, 2004]. In contrast to single-SNP analyses, such tests allow for all of the SNPs within the region of a gene to be modeled jointly as a set and can take into account the linkage disequilibrium (LD) structure within the gene. By grouping SNPs based upon prior biological information, SNP-set testing may improve power and increase the chance of reproducible significant findings [Wu et al., 2010], particularly when multiple causal SNPs are present in a given gene. Although SNP-set approaches are not necessarily restricted to gene-level definition, the gene as a functional unit is a natural choice and provides an intuitive decomposition of the genome.

Kernel machine methods in particular have provided a successful tool in SNP-set association testing [Kwee et al., 2008; Wu et al., 2010, 2011]. Such approaches determine genetic association through representations of genomic similarity between pairs of subjects [Schaid, 2010a, 2010b]. Recently, Li and Cui presented a gene-level interaction approach for continuously valued quantitative traits using a kernel machine smoothing-spline ANOVA model, which they refer to as SPA3G [Li and Cui, 2012]. An application of this method for a binary response, such as disease status, presents unique challenges that preclude a direct application of SPA3G, notably that the response can no longer be assumed to be Gaussian distributed. These challenges motivated our work to adapt the methods within SPA3G to be applicable to case-control studies.

In this paper, we outline a comprehensive approach toward hypothesis testing for marginal and interaction effects of genes in association analysis for dichotomous responses using regression-based score tests. In addition to detailing omnibus and marginal tests, we define a kernel regression approach toward gene-gene interaction detection for a dichotomous response under a generalized linear mixed model (GLMM) framework. We evaluate the performance of these testing approaches using coalescent simulation data under a variety of experimental conditions and investigate their relation to one another within the context of multiple epistatic models. We also compare our approach to exhaustive SNP-SNP logistic regression and two leading gene-level gene-gene interaction methods. Finally, we discuss the implications of our findings and suggest future directions for further development.

Methods

Consider a case-control association study involving N individuals, such that N is composed of N_Case cases and N_Cont controls. Let $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0001$ be a binary representation of case-control status, such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0002$ if the jth subject is designated a case and 0 otherwise. Let $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0003$ be an $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0004$ set of any additional covariate data, and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0005$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0006$ be respective $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0007$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0008$ matrices of genotypes for markers contained within the regions of genes 1 and 2, where q₁ and q₂ correspond to the number of respective markers within each gene. It is assumed that these regions are defined a priori based upon some relevant biological criteria. We define genotypes under an additive model, such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0009$ is the integer count of minor alleles observed at marker k in gene i for subject j.

Using a positive-definite kernel function, $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0010$ , we can map $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0011$ to some Hilbert space through the mapping $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0012$ such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0013$ is an inner product space. This is accomplished through the “kernel trick” [Schölkopf and Smola, 2002] that calculates inner products in $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0014$ through the given kernel function, such that

$urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0015$

where $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0016$ represents all the marker genotypes for gene i for subject j. The kernel function circumvents the necessity to calculate the explicit mappings $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0017$ , yielding the kernel space mapping $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0018$ of the respective original genotype matrix $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0019$ . This kernel matrix $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0020$ is an $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0021$ full Gram matrix, such that the element-wise definition is given as $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0022$ for $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0023$ . From Aronszajn [1950], we also define the interaction kernel matrix $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0024$ as $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0025$ , where the operator ○ represents the Hadamard, or element-wise, product. Through $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0026$ , $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0027$ , and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0028$ , the genetic effect of the two genes of interest on the phenotypic variation is decomposed into main and interaction effects. These matrices in turn can be applied in a mixed-model context as underlying covariance structures for variance components. Let $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0029$ represent the probability that the ith observation is a case, and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0030$ . We consider a mixed effects logistic model for $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0031$ , such that

$urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0032$

where $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0033$ , $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0034$ , and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0035$ are independent $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0036$ random effect vectors, and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0037$ .

Global Hypothesis Test

Define the omnibus, or global, hypothesis of no genetic effect such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0038$ . The score statistic is defined as $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0039$ , where $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0040$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0041$ are the fitted values of μ on $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0042$ under H₀. Under the null hypothesis, Q₀ is asymptotically distributed as a weighted mixture of chi-square distributions [Liu et al., 2008]. Although there are a number of methods to characterize this distribution for purposes of hypothesis testing, we employ Pearson's three-moment approach [Imhof, 1961] because the approximation error can be bounded.

Marginal and Interaction Hypothesis Tests

It is possible to test for the presence of marginal effects of each gene individually by using the respective kernel matrix in the framework of the score statistic, such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0043$ for $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0044$ . This is equivalent to the sequence kernel association test (SKAT) [Wu et al., 2011]. If there are no marginal effects present ( $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0045$ , $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0046$ ), we can also test specifically for a statistical interaction between genes 1 and 2 via the score statistic $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0047$ , which we refer to as the interaction test. For any of these tests, we again approximate the null distribution of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0048$ by the Pearson's approximation.

Composite Hypothesis Test

We also define a test specifically for an interaction effect adjusting for the presence of marginal gene effects ( $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0049$ ), such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0050$ . This requires fitting the null GLMM that includes the main effects of the two genes, which may be conducted using penalized quasi-likelihood (PQL) [Breslow and Clayton, 1993]. Maximum likelihood approaches toward fitting GLMMs involve intractable integration of high dimension, and PQL utilizes Laplace approximation in order to accommodate this integration through iterative estimation of the fixed and random model components. For our purposes, we fit this model using the glmmPQL function from the MASS library in R [Venables and Ripley, 2002].

Definition of the corresponding score statistic is complicated by the fact that the covariance matrix is no longer diagonal, but includes off-diagonal binomial covariances that are difficult to obtain. One remedy is to adapt work by Lin [1997], which outlines score statistics for variance component testing in GLMMs as follows. Define $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0051$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0052$ to be diagonal $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0053$ matrices with corresponding diagonal elements

$urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0054$

where $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0055$ is the link function in the GLMM, $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0056$ denotes the first derivative of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0057$ with respect to $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0058$ , $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0059$ is the corresponding variance function, and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0060$ is the mean for the jth subject under the null model. Because we apply the canonical logit link function, it follows that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0061$ . From Lin [1997], we define $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0062$ to be the PQL working vector under the null GLMM, such that

$urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0063$

Then, we define restricted maximum likelihood (REML) version of our composite score statistic to be

$urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0064$

where $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0065$ is the null projection matrix and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0066$ is the estimated null covariance matrix with variance component parameter estimates $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0067$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0068$ . Although Lin goes on to define a normalized version of the score statistic, our early findings indicated strong biases for a dichotomous response under the null. Similar to the global and marginal score tests, we derive the null distribution for $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0069$ using the Pearson's approximation.

Computational Considerations

Fitting the composite null model using PQL requires that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0070$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0071$ be decomposed into corresponding square-root matrices $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0072$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0073$ , such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0074$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0075$ . When a linear (or weighted linear) kernel is used, this is easily accommodated because $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0076$ , where $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0077$ is a diagonal weight matrix, such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0078$ . If a nonlinear kernel function, such as the Gaussian kernel, is used, then this may be completed using the incomplete Cholesky decomposition [Kershaw, 1978] of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0079$ , whereby $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0080$ is the lower triangle matrix. Then, the random effects $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0081$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0082$ are modeled as $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0083$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0084$ , such that $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0085$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0086$ . Because such decompositions can be computationally intensive, there is initial appeal to the use of some form of linear kernel for this application, particularly when the number of markers per gene is relatively small.

Algorithms for approximating the null distribution of the score statistics ( $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0087$ are dependent upon deriving the eigenvalues of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0088$ for the respective kernel matrix K and projection matrix P of each test, which always will be $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0089$ . This can be computationally demanding, as such decompositions are in practice $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0090$ . However, equivalent eigenvalues can be derived from $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0091$ . This form is more appealing for two reasons: (1) it is guaranteed to be positive definite, which can be exploited by decomposition algorithms; and (2) if $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0092$ , the computational burden of this eigendecomposition is greatly reduced. This can motivate the use of low-rank approximations of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0093$ , although we leave this topic to future research.

Kernel Selection

There are multiple options for which kernel function to apply to the marker data [Schaid et al., 2005]. We used a polygenic kernel, which is a linear kernel applied to standardized genotype data. We define the polygenic kernel representation for gene i to be $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0094$ where

$urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0095$

Because this is a type of linear kernel, it affords some computational benefits mentioned previously. However, there may be gains in statistical power in utilizing nonlinear kernel functions, such as the Gaussian kernel, which may be capable of detecting nonlinear interactions.

Simulation Study

In order to assess the properties of type I error rate control and statistical power for our hypothesis tests, we devised a comprehensive simulation study. Our basic simulation strategy was to simulate haplotypes and randomly combine haplotypes to create a large population of genotypes. Then, under a given genetic disease model and prevalence, we simulated disease status and performed case-control sampling to obtain our test data. The details of our simulation are given below.

To simulate genotypic data, we used the calibrated coalescent model simulation software COSI [Schifano et al., 2012] to generate two independent sets of ten thousand 50 kb regions, each representative of a distinct gene. Recombination maps were based upon observed LD structure in samples of European ancestry. A derived minor allele frequency (dMAF) was calculated for each marker based upon its frequency in the haplotype population to represent a population-based value. From these pools of haplotypes, we generated a large population of N_pop genotype profiles for simulated individuals by combining two randomly selected haplotypes. The two gene-wise datasets had 1,017 and 1,040 polymorphic sites, respectively, with 116 and 164 being common SNPs (dMAF ⩾0.05). We then selected a subset of common SNPs for each gene to represent our simulation genotyped marker data, such that the maximum pairwise Pearson correlation between any two SNPs in a given gene was ⩽0.50. This resulted in 12 and 25 genotyped SNPs for genes 1 and 2, respectively, ranging in dMAF from 0.05 to 0.49. LD plots of both SNP sets are found in Figure 1.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Pairwise linkage disequilibrium plots of the simulation SNPs for (A) gene 1 and (B) gene 2.

To simulate disease status for given genotypes, we adopted a model parameterization applied by Aschard et al. [2012], which used a log-additive approach such that the marginal and interaction effects are independent in order to directly control the marginal and interaction effect sizes. This approach uses a recoding of the genotype values $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0096$ to corresponding genotype weights, $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0097$ , which are based upon the dMAF of the respective SNPs. Let Ω₁ and Ω₂ respectively define the subsets of gene 1 and gene 2 SNPs selected to be causal. Dichotomous phenotypes are then simulated via a log-linear model with probability of occurrence $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0098$ , such that for subject j

$urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0099$

where log indicates the natural logarithm, a₀ is the population average prevalence, $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0100$ and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0101$ the marginal effects for the respective SNPs, and $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0102$ the interaction effect between SNP l in gene 1 and SNP m in gene 2, with $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0103$ (0 or 1) an indicator for the presence of that specific interaction. The genotype weights $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0104$ are functions of the population-level MAF (dMAF) of the respective SNPs, and are defined such that the expected effect of each interaction term conditional on a specific genotype at one locus is always equal to 0 (see Aschard et al. [2012] for details). We let all marginal effects be randomly selected uniformly between log(1.1) and log (1.3) to reflect realistic relative risk (RR) values observed in GWAS. By setting various effect components to be null, we also control which genetic effects are present in our disease model. For each simulation, we generated a population of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0105$ genotypes and performed case-controls sampling, with disease prevalence fixed at 0.10. All causal SNPs were randomly selected for each simulation replication.

Finally, given that gene-gene interaction analysis is an active area of research, we compared the power of our testing procedures to gene-based Bonferroni-adjusted single SNP-SNP logistic regression, along with two leading gene-level approaches: kernel canonical correlation analysis (KCCA) [Larson et al., 2013; Yuan et al., 2012] and principal component (PC) analysis-based logistic regression modeling (PC-LR). KCCA is an LD-based procedure, which uses kernelized canonical correlation analysis to test for differences in association between genes across case-control status using a Gaussian kernel function. Variations of PC-LR [Bhattacharjee et al., 2010; He et al., 2011; Wang and Abbott, 2008] have been shown to be powerful approaches for gene-level interaction analysis by reducing the marker data for a given gene to a few leading PCs. For our PC-LR analysis, we derive the lead PC term from each gene and test the statistical significance of their interaction in the presence of their marginal effects within a basic logistic regression model.

Results

Type I Error

We examined type I error rate control for sample sizes of 1,000, 1,500, and 2,000, with balanced numbers of cases and controls. For the global, marginal, and interaction tests, a total of 100,000 simulation runs were run for each sample size, with type I error rates evaluated at α levels of 0.001 and 0.0001. Table 1 presents the type I error simulation results for these tests, along with Figure 2 presenting QQ plots of the respective −log10 transformed p-values. These tests exhibit near nominal type I error rates across all α levels, with the interaction test tending toward being more conservative for smaller sample sizes.

Table 1. Complete null type I error rates for global, marginal, and interaction tests

N	α = 1 × 10⁻³	α = 1 × 10⁻⁴	α = 1 × 10⁻³	α = 1 × 10⁻⁴	α = 1 × 10⁻³	α = 1 × 10⁻⁴
	Global test		Marginal test		Interaction test
1,000	8.3 × 10⁻⁴	5.0 × 10⁻⁵	9.3 × 10⁻⁴	6.0 × 10⁻⁵	3.7 × 10⁻⁴	1.0 × 10⁻⁵
1,500	8.0 × 10⁻⁴	6.0 × 10⁻⁵	1.1 × 10⁻³	1.1 × 10⁻⁴	5.4 × 10⁻⁴	3.0 × 10⁻⁵
2,000	8.7 × 10⁻⁴	6.0 × 10⁻⁵	1.1 × 10⁻³	1.2 × 10⁻⁴	7.0 × 10⁻⁴	4.0 × 10⁻⁵

We also examined type I error rate control for the composite test when marginal effects are present in both genes but there is no interaction ( $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0106$ , and contrast it with that of the interaction test where such marginal effects are not taken into account. We considered disease models where the number of causal markers per gene was 1 or 2, and ran 4,000 replications. Results for the error rates of the two tests can be found in Table 2 at α levels of 0.05 and 0.01. Interestingly, the findings indicate that both the interaction test and composite test control the type I error rate under both models despite the lack of marginal effect adjustment for the interaction test.

Table 2. Type I error rates for interaction and composite tests with marginal effects present

N	α = 0.05	α = 0.01	α = 0.05	α = 0.01	α = 0.05	α = 0.01	α = 0.05	α = 0.01
	1 Causal SNP per gene				2 Causal SNPs per gene
	Interaction (Q₃)		Composite $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0107$		Interaction (Q₃)		Composite $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0108$
1,000	0.0390	0.0090	0.0398	0.0088	0.0355	0.0058	0.0378	0.0050
1,500	0.0385	0.0065	0.0375	0.0063	0.0408	0.0070	0.0398	0.0070
2,000	0.0420	0.0063	0.0438	0.0068	0.0440	0.0108	0.0445	0.0108

Power

We first considered a set of simulations in which there were single causal interacting SNPs in each gene for sample sizes of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0109$ , 1,500, and 2,000. Because there is specific interest in being able to detect interacting loci in the absence of marginal effects, we considered simulation conditions with and without marginal effects present. We examined four specific values of γ₁₂ [log(1.5), log(2.0), log(2.5), log(3.0)] in our simulations, and ran 500 replications for each unique set of conditions, reporting empirical power at an α level of 0.05. Figure 3 presents our findings for all of our score-based tests along with the SNP-SNP, PC-LR, and KCCA approaches under these simulation conditions. The results show that when marginal effects are present, the various score tests generally perform best, especially at lower values of γ₁₂. When marginal effects were absent, KCCA and the global test had the highest power at lower effect sizes as well. Interestingly, the marginal tests indicate power levels above the type I error rate despite no marginal effects being explicitly modeled.

**Figure 3**
Open in figure viewer PowerPoint

Empirical power curves ( $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0110$ as a function of interaction effect size exp(γ₁₂), for the global, marginal, interaction, and composite tests, along with SNP-SNP logistic regression, PCA, and KCCA methods. Results are shown with marginal effects present for sample sizes (A) $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0111$ , (B) $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0112$ , and (C) $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0113$ , and with marginal effects absent for sample sizes (D) $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0114$ , (E) $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0115$ , and (F) $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0116$ .

In all simulations, the SNP-SNP approach tended to be best (or at least competitive) when the interaction effect size was most extreme, regardless of whether or not marginal effects were present. This corroborates previous findings that have found SNP-SNP methods to be competitively powerful when the gene-level interaction is isolated to a single pair of SNPs [He et al., 2011; Li and Cui, 2010].

We also considered an additional set of simulations where two pairs of interacting SNPs were present across genes, and values of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0117$ were randomly sampled uniformly from the interval [log(1.5), log(2.0)]. All other simulation conditions were the same as previously defined and 1,000 replications were run per unique set of conditions. A barplot of these results can be found in Figure 4. These findings indicate that even in the absence of marginal effects, the global test is the most powerful approach for identifying the presence of interaction. The interaction and composite tests were relatively close in their empirical power, and performed similarly to the SNP-SNP testing. The KCCA approach performed comparably to the previously mentioned test when no marginal effects were present, but was less powerful when marginal effects were included.

**Figure 4**
Open in figure viewer PowerPoint

Barplot of empirical power results ( $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0118$ for hypothesis testing when the number of causal SNPs per gene is two, where interaction effects $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0119$ are uniformly drawn from [log(1.5), log(2.0)]. Results are presented for sample sizes of $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0120$ , 1,500, and 2,000, with marginal effects either present (Marg = T) or absent (Marg = F).

It is important to note that under all simulations, the interaction test was more powerful than the composite test regardless of the inclusion of marginal effects.

Discussion

Gene-gene interactions are becoming an increasingly common component to genomic association analysis. Increasing GWAS chip sizes, imputation, and next-generation sequencing platforms will continue to increase the number of genotyped intragenic SNPs, and the need for computationally efficient strategies for exploratory interaction analysis among loci has grown in response. In this paper, we have detailed a comprehensive approach toward detecting the presence of genetic effects, specifically gene-gene interactions, for case-control genetic association studies. We have devised a global test for detecting the presence of gene-level associations via kernel matrix representations of marker data. Using a simulation study based upon realistic genotype data, we have demonstrated that it is a powerful approach toward detecting the presence of both main and interaction effects of gene-level risk association. By adapting the work of Li and Cui for quantitative traits to binary traits using GLMMs, we have also defined a score test, the composite test, for detecting gene-gene interactions after adjusting for main effects.

As Figures 3 and 4 indicate, the global test is a powerful approach toward detecting gene-gene interactions even in the absence of marginal effects. Given that the global test only requires fitting a single null regression model, it is a computationally attractive screening procedure for possible interactions and can rapidly be implemented in a genome-wide analysis. Subsequent testing performed on significant findings can then be applied to identify the particular architecture of the genetic association. We also found that marginal tests result in significant findings despite the exclusion of marginal effects from our simulations. Although lower powered than the global test, conducting solely marginal tests (SKAT) could be an effective alternative strategy in contrast to the testing burden of exhaustive pairwise exploratory analysis.

As per Table 2, the interaction test (Q₃) does not incur any quantifiable bias when multiple SNPs with true marginal effects are present in the simulation model. Although the included simulations are restricted to a relatively small number of total SNPs per gene as well as marginal effects of modest size, this is a surprising result that raises the question of whether or not the interaction test can be used as a proxy for the composite test. More surprising is that the interaction test is more powerful than the composite test in all of our simulations. Although we refrain from recommending the composite test be abandoned for the interaction test, it is computationally appealing prospect which warrants further investigation.

With increasing numbers of polymorphic sites being either genotyped or imputed in association studies, computational burden is of particular importance, especially relative to SNP-level testing. For example, on a modern workstation with an Intel® Core™ i5 3.10 Ghz processor and 4 GB of RAM, running all possible pairwise SNP-SNP tests for our simulation required 7.914 sec per simulation replication when $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0121$ . Running the global score test, meanwhile, requires only 2.595 sec. This discrepancy in computational burden is further evidenced if we increase SNP-level testing burden, as such analyses scale poorly as the number of included SNPs increases. If we consider a simple data simulation where genotypes are independently sampled from a binomial distribution, and set the number of genotyped SNPs per gene to 100, the respective compute times for exhaustive SNP-SNP testing and the global test are 236.54 and 22.00 sec, respectively. It is important to note, however, that the computational burden of the kernel-based tests scales largely with respect to sample size N, as this requires decomposition of larger and larger kernel Gram matrices. Respective compute times for the SNP-SNP tests and the global test when $urn:x-wiley:07410395:media:gepi21749:gepi21749-math-0122$ on our COSI simulation data are 12.123 and 34.044 sec, respectively. This burden can be mitigated with varying strategies, however, including low-rank decompositions [Bach and Jordan, 2005], which could significantly reduce computational times. More work is necessary to explore the utility of these approaches.

Even with computationally efficient implementations of our gene-level interaction tests, exhaustive pairwise analysis of a genome with 25,000 genes would require math image hypothesis tests, which is generally infeasible with respect to both computational and multiple testing burdens. Efficient strategies for implementing agnostic genome-wide analysis thus should be dependent in part on prior functional information. One strategy would be to utilize protein-protein interaction (PPI) databases to define a body of potential gene-gene interaction pairs, greatly reducing the testing space. For example, we downloaded the protein interaction network analysis [Wu et al., 2009] PPI dataset for binary interactions in Homo sapiens (accessed February 2013). This information was reduced to the gene level (HUGO designation) and redundant pairs were removed. This resulted in 106,004 unique gene pairs between 14,784 individual genes, a substantially reduced testing multiplicity. Stricter inclusion criteria, such as experimental validation, can further reduce this testing set.

Although there are a number of benefits to gene-level testing, questions remain as to how to interpret replicability of specific findings, because it is possible different sets of interacting SNPs may yield the same significant gene pair. This requires a paradigm shift in how gene-level association is considered relative to individual SNPs, being more akin to gene-set types of analyses. Moreover, special considerations will be necessary for multiple testing, because there is a clear issue of dependence among test statistics where a given gene is a member of multiple gene pairs being evaluated. Additional work is necessary to evaluate the effects of such dependence on multiple testing correction.

Power analysis for multilocus approaches, such as gene-level testing, is complicated by a number of factors, including the quantity of total and interacting SNPs, their respective MAFs, overall LD structure of the genotyped SNPs themselves, and underlying models of epistasis [Marchini et al., 2005]. Although our random selection of causal SNPs in our simulations averages over a number of these factors, our simulations are by no means exhaustive and systematic influences on power will remain. The kernel function itself may also impact statistical power, as the polygenic kernel is just one of many possible options and alternative selections may behave differently from our findings. Although it is not within the scope of this paper to investigate the impact of the kernel function itself, we acknowledge that strategic kernel selection may impact hypothesis-testing performance. Influence of kernel selection under differing epistatic models is a focus of future work, particularly with respect to its comparative performance with KCCA, which is specifically capable of nonlinear interaction detection.

Although we have presented this work strictly within the context of a dichotomous trait, we note that the theoretical adaptation of our approach from SPA3G could be modified to account for any non-Gaussian response with a presumed exponential family distribution with little difficulty. We also foresee this testing framework being expanded to address pathway analysis applications and higher order interactions through linear combinations of gene-level kernel matrices and their Hadamard products.

Acknowledgments

This research was supported by the U.S. Public Health Service, National Institutes of Health, contract number GM065450. We also thank the anonymous reviewers for their constructive comments. The authors declare no conflict of interest.

References

Aronszajn N. 1950. Theory of reproducing kernels. Trans Am Math Soc 68(4): 337–404.
Google Scholar
Aschard H, Chen J, Cornelis MC, Chibnik LB, Karlson EW, Kraft P. 2012. Inclusion of gene-gene and gene-environment interactions unlikely to dramatically improve risk prediction for complex diseases. Am J Hum Genet 90(6): 962–972.
10.1016/j.ajhg.2012.04.017
CAS PubMed Web of Science® Google Scholar
Bach FR, Jordan MI. 2005. Predictive low-rank decomposition for kernel methods. Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany: ACM. p 33–40.
Google Scholar
Bhattacharjee S, Wang Z, Ciampa J, Kraft P, Chanock S, Yu K, Chatterjee N. 2010. Using principal components of genetic variation for robust and powerful detection of gene-gene interactions in case-control and case-only studies. Am J Hum Genet 86(3): 331–342.
10.1016/j.ajhg.2010.01.026
CAS PubMed Web of Science® Google Scholar
Bonetta L. 2010. Protein-protein interactions: interactome under construction. Nature 468(7325): 851–854.
10.1038/468851a
CAS PubMed Web of Science® Google Scholar
Breslow NE, Clayton DG. 1993. Approximate Inference in Generalized Linear Mixed Models. J Am Stat Assoc 88(421): 9–25.
10.1080/01621459.1993.10594284
Web of Science® Google Scholar
Cordell HJ. 2009. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 10(6): 392–404.
10.1038/nrg2579
CAS PubMed Web of Science® Google Scholar
Greenland S. 2009. Interactions in epidemiology: relevance, identification, and estimation. Epidemiology 20(1): 14–17.
10.1097/EDE.0b013e318193e7b5
PubMed Web of Science® Google Scholar
He J, Wang K, Edmondson AC, Rader DJ, Li C, Li MY. 2011. Gene-based interaction analysis by incorporating external linkage disequilibrium information. Eur J Hum Genet 19(2): 164–172.
10.1038/ejhg.2010.164
Web of Science® Google Scholar
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106(23): 9362–9367.
10.1073/pnas.0903103106
CAS PubMed Web of Science® Google Scholar
Howard TD, Koppelman GH, Xu JF, Zheng SQL, Postma DS, Meyers DA, Bleecker ER. 2002. Gene-gene interaction in asthma: IL4RA and IL13 in a Dutch population with asthma. Am J Hum Genet 70(1): 230–236.
10.1086/338242
CAS PubMed Web of Science® Google Scholar
Imhof JP. 1961. Computing the distribution of quadratic forms in normal variables. Biometrika 48(3/4): 419–426.
10.2307/2332763
Web of Science® Google Scholar
Jorgenson E, Witte JS. 2006. A gene-centric approach to genome-wide association studies. Nat Rev Genet 7(11): 885–891.
10.1038/nrg1962
CAS PubMed Web of Science® Google Scholar
Kempthorne O. 1954. The correlation between relatives in a random mating population. Proc R Soc Lond B 143(910): 103–113.
10.1098/rspb.1954.0056
Web of Science® Google Scholar
Kershaw DS. 1978. Incomplete Cholesky-conjugate gradient method for iterative solution of systems of linear equations. J Comput Phys 26(1): 43–65.
10.1016/0021-9991(78)90098-0
Web of Science® Google Scholar
Kwee LC, Liu D, Lin X, Ghosh D, Epstein MP. 2008. A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet 82(2): 386–397.
10.1016/j.ajhg.2007.10.010
CAS PubMed Web of Science® Google Scholar
Larson NB, Jenkins GD, Larson MC, Vierkant RA, Sellers TA, Phelan CM, Schildkraut JM, Sutphen R, Pharoah PP, Gayther SA and others. 2013. Kernel canonical correlation analysis for assessing gene-gene interactions and application to ovarian cancer. Eur J Hum Genet, doi: 10.1038/ejhg.2013.69
10.1038/ejhg.2013.69
PubMed Web of Science® Google Scholar
Li S, Cui Y. 2012. Gene-centric gene-gene interaction: a model-based kernel machine method. Ann Appl Stat 6(3): 1134–1161.
10.1214/12-AOAS545
Web of Science® Google Scholar
Li Z, Zhang Y, Wang Z, Chen J, Fan J, Guan Y, Zhang C, Yuan C, Hong W, Wang Y and others. 2012. The role of BDNF, NTRK2 gene and their interaction in development of treatment-resistant depression: data from multicenter, prospective, longitudinal clinic practice. J Psychiatr Res 47(1): 8–14.
10.1016/j.jpsychires.2012.10.003
PubMed Web of Science® Google Scholar
Lin XH. 1997. Variance component testing in generalised linear models with random effects. Biometrika 84(2): 309–326.
10.1093/biomet/84.2.309
Web of Science® Google Scholar
Liu DW, Ghosh D, Lin XH. 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinform 9: 292.
10.1186/1471-2105-9-292
CAS PubMed Web of Science® Google Scholar
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A and others. 2009. Finding the missing heritability of complex diseases. Nature 461(7265): 747–753.
10.1038/nature08494
CAS PubMed Web of Science® Google Scholar
Marchini J, Donnelly P, Cardon LR. 2005. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 37(4): 413–417.
10.1038/ng1537
CAS PubMed Web of Science® Google Scholar
Moore JH. 2003. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered 56(1–3): 73–82.
10.1159/000073735
PubMed Web of Science® Google Scholar
Moore JH, Ritchie MD. 2004. STUDENTJAMA. The challenges of whole-genome approaches to common diseases. J Am Med Assoc 291(13): 1642–1643.
10.1001/jama.291.13.1642
CAS PubMed Web of Science® Google Scholar
Moore JH, Williams SM. 2002. New strategies for identifying gene-gene interactions in hypertension. Ann Med 34(2): 88–95.
10.1080/07853890252953473
CAS PubMed Web of Science® Google Scholar
Moore JH, Williams SM. 2009. Epistasis and its implications for personal genetics. Am J Hum Genet 85(3): 309–320.
10.1016/j.ajhg.2009.08.006
CAS PubMed Web of Science® Google Scholar
Neale BM, Sham PC. 2004. The future of association studies: gene-based analysis and replication. Am J Hum Genet 75(3): 353–362.
10.1086/423901
CAS PubMed Web of Science® Google Scholar
Schaid DJ. 2010a. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered 70(2): 109–131.
10.1159/000312641
PubMed Web of Science® Google Scholar
Schaid DJ. 2010b. Genomic similarity and kernel methods II: methods for genomic information. Hum Hered 70(2): 132–140.
10.1159/000312643
CAS PubMed Web of Science® Google Scholar
Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN. 2005. Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet 76(5): 780–793.
10.1086/429838
CAS PubMed Web of Science® Google Scholar
Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SL, Peyser PA, Lin X. 2012. SNP set association analysis for familial data. Genet Epidemiol 36(8): 797–810.
10.1002/gepi.21676
PubMed Web of Science® Google Scholar
Schölkopf B, Smola A. 2002. Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge: The MIT Press.
Google Scholar
Sima X, Xu J, Li Q, Luo L, Liu J, You C. 2012. Gene-gene interactions between interleukin-12A and interleukin-12B with the risk of brain tumor. DNA Cell Biol 31(2): 219–223.
10.1089/dna.2011.1331
CAS PubMed Web of Science® Google Scholar
Venables WN, Ripley BD. 2002. Modern Applied Statistics With S. New York: Springer.
10.1007/978-0-387-21706-2
Google Scholar
Wang K, Abbott D. 2008. A principal components regression approach to multilocus genetic association studies. Genet Epidemiol 32(2): 108–118.
10.1002/gepi.20266
PubMed Web of Science® Google Scholar
Wu J, Vallenius T, Ovaska K, Westermarck J, Makela TP, Hautaniemi S. 2009. Integrated network analysis platform for protein-protein interactions. Nat Methods 6(1): 75–77.
10.1038/nmeth.1282
CAS PubMed Web of Science® Google Scholar
Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X. 2010. Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet 86(6): 929–942.
10.1016/j.ajhg.2010.05.002
CAS PubMed Web of Science® Google Scholar
Wu MC, Lee S, Cai TX, Li Y, Boehnke M, Lin XH. 2011. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89(1): 82–93.
10.1016/j.ajhg.2011.05.029
CAS PubMed Web of Science® Google Scholar
Yuan ZS, Gao QS, He YG, Zhang XS, Li FY, Zhao JH, Xue FZ. 2012. Detection for gene-gene co-association via kernel canonical correlation analysis. BMC Genet 13: 83.
10.1186/1471-2156-13-83
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume37, Issue7

November 2013

Pages 695-703

A Kernel Regression Approach to Gene-Gene Interaction Detection for Case-Control Studies

ABSTRACT

Introduction