Research Article

Full Access

Strategies for Developing Prediction Models From Genome-Wide Association Studies

Corresponding Author

Jincao Wu

Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America

Correspondence to: Jincao Wu, Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD 20850, USA. E-mail: [email protected]Search for more papers by this author

Ruth M. Pfeiffer,

Ruth M. Pfeiffer

Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America

Search for more papers by this author

Mitchell H. Gail,

Mitchell H. Gail

Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America

Search for more papers by this author

Jincao Wu,

Corresponding Author

Jincao Wu

Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America

Ruth M. Pfeiffer,

Ruth M. Pfeiffer

Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America

Search for more papers by this author

Mitchell H. Gail,

Mitchell H. Gail

Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America

Search for more papers by this author

First published: 25 October 2013

https://doi.org/10.1002/gepi.21762

Citations: 19

Share a link

Email
Wechat
Bluesky

Abstract

Genome-wide association studies (GWASs) have identified hundreds of single nucleotide polymorphisms (SNPs) associated with complex human diseases. However, risk prediction models based on them have limited discriminatory accuracy. It has been suggested that including many such SNPs can improve predictive performance. Here, we studied various aspects of model building to improve discriminatory accuracy, as measured by the area under the receiver operating characteristic curve (AUC), including: (1) How well does a one-phase procedure that selects SNPs and estimates odds ratios on the same data perform? (2) How should training data be allocated between SNP selection (Phase 1) and estimation (Phase 2) in a two-phase procedure? (3) Should SNP selection be based on P-value thresholding or ranking P-values? (4) How many SNPs should be selected? and (5) Is multivariate estimation preferred to univariate estimation in the presence of linkage disequilibrium (LD)? We used realistic estimates of the distributions of genetic effect sizes, allele frequencies, and LD patterns based on GWAS data for Crohn's disease and prostate cancer. Theory and simulations were used to estimate AUC. Empirical risk models based on 10,000 cases and controls had considerably lower AUC than theoretically achievable. The most critical aspect of prediction model building was initial SNP selection. The single-phase procedure achieved higher AUC than the two-phase procedure. Multivariate estimation did not perform as well as univariate (marginal) estimation. For complex diseases and samples of 10,000 or fewer cases and controls, one should limit the number of SNPs to tens or hundreds.

Introduction

Case-control genome-wide association studies (GWASs) for complex diseases have identified many single nucleotide polymorphisms (SNPs) that are associated with disease. However, risk models based on such SNPs have had only modest discriminatory accuracy as measured by the area under the receiver operating characteristic curve (AUC) (e.g. Jostins and Barrett [2011]). Two lines of evidence suggest that SNPs could provide much more predictive information, if only one could tap into the “missing heritability” suggested by phenotypic familial correlations. Purcell et al. [2009] and Stahl et al. [2012] showed that risk scores based on large numbers of SNPs could explain more heritability than those based on a small number of rigorously confirmed SNPs. Studies of correlations in GWAS data showed that all SNPs together account for much more heritability than the rigorously validated SNPs [Lee et al., 2011; Yang et al., 2010].

However, it is not possible to take advantage of this potential information to build discriminating risk models unless the sample size of the GWAS data for model building is large enough. Wray et al. [2007] used information on genetic architecture to conclude that half the heritability could be captured with 10,000 cases and controls, whereas Chatterjee et al. [2013] found that hundreds of thousands were needed for complex diseases. Even if the largest GWAS to date were tripled in size, the foreseeable AUC values based on the additional SNPs for many cancers would remain modest [Park et al., 2012], because most of the SNPs with the largest odds ratios have already been detected.

Feasible sample sizes are more modest for many complex diseases. Therefore we study model-building for sample sizes $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0001$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0002$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0003$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0004$ , based on distributions of log odds ratios per allele developed for Crohn's disease and prostate cancer. We show that these distributions are consistent with heritability estimates on the liability scale [Chatterjee et al., 2013; Lee et al., 2011] and use them to investigate a number of facets of model-building. These include: (1) How well does a one-phase procedure that uses all the data to select SNPs and to estimate their effects perform, compared to a two-phase procedure that eliminates bias from the “winner's curse” [Zöllner and Pritchard, 2007] by selecting SNPs in Phase 1 and estimating effects from independent Phase 2 data? (2) What proportion of the data should be allocated to Phase 1 in the two-phase procedure? (3) Should one select SNPs by using a P-value threshold or by ranking P-values? (4) How many SNPs should be selected? (5) Does univariate or multivariate estimation perform better? (6) Should one filter out highly correlated SNPs? and 7) Do findings for AUC also hold for the probability of correct classification [Liu et al., 2012]? Unlike most previous work, we evaluate these strategies both in the presence and absence of linkage disequilibrium (LD). We also compare our results for Crohn's disease with those of Kooperberg et al. [2010], who used more complex modeling strategies on the same data.

Risk Model and Two-Phase Procedure

Risk Model

We assume that a large GWAS case-control study yields data on N SNP genotypes. Let $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0005$ denote the disease status of subject j ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0006$ or 1), and let $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0007$ be the number of minor alleles for SNP i, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0008$ for subject j, with minor allele frequency (MAF) $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0009$ . The vector of all SNP genotypes for subject j is $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0010$ . We let the first M of the N SNPs be truly disease associated, and assumed that the disease risk in the source population is given by

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0011$ (1)

where $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0012$ , μ is the intercept in the source population and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0013$ is the log odds ratio for SNP i. The SNPs in 1 can also be in LD with themselves or with other SNPs.

Development (Phases 1 and 2) and Validation of Risk Model

In Phase k, there are $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0014$ cases and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0015$ controls, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0016$ . The training data for Phases 1 and 2 consist of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0017$ cases and n controls. The proportion $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0018$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0019$ are allocated to Phases 1 and 2, respectively. We use the term “training Data” to distinguish the “n” cases and controls used to build risk models from the independent validation data. We varied λ and other parameters to maximize the AUC, which is estimated with independent nontraining data.

Phase 1: SNP Selection

For each SNP i we used a marginal logistic regression model

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0020$ (2)

to obtain an estimate $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0021$ and the corresponding Wald statistic, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0022$ for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0023$ . For all nondisease-associated SNPs, the asymptotic distribution of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0024$ is a central $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0025$ distribution, and for disease-associated SNPs, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0026$ a noncentral $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0027$ distribution, where the noncentrality parameter depends on $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0028$ , on the MAF $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0029$ , and on the sample size n₁ [Gail et al., 2008]. Required variances are in Appendix A.1.1. We also computed $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0030$ , where $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0031$ is the $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0032$ quantile of a central $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0033$ distribution.

We selected disease-associated SNPs based either on P-value thresholding or on ranking of P-values. For P-value thresholding, we selected the ith SNP if its P-value was less than a prespecified cut-off. With ranking, all SNPs corresponding to the smallest T P-values were selected [Gail et al., 2008]. With P-value thresholding, a variable number, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0034$ , of SNPs were selected, while ranking led to exactly $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0035$ selected SNPs.

Phase 2: Estimating Log Odds Ratios

Let $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0036$ index the SNPs selected in Phase 1. We used independent Phase 2 data to estimate $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0037$ for the $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0038$ SNPs and fitted both multivariate logistic regression,

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0039$ (3)

and marginal univariate logistic regressions,

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0040$ (4)

We denote the estimates from either model 3 or 4 by $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0041$ . In the presence of LD, we excluded some highly correlated SNPs (Appendix B).

Model Validation: Score Computation and AUC Estimation

To estimate the AUC, we computed the following score for each subject j in the independent validation dataset,

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0042$ (5)

The AUC is the probability that the score S₁ for a randomly selected case exceeds the score S₀ for a randomly selected control, or more precisely $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0043$ . We estimated AUC nonparametrically from the Mann–Whitney statistic. In independent validation data, we always set n₃ = 400 cases and n₃ = 400 controls, which results in $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0044$ = 0.0004 (Appendix A.1.2).

Simulating Case-Control Data to Estimate the AUC

Simulating a Source Population With Joint SNP Genotypes in the Presence of LD

To generate case-control data, we applied model 1 to a source population of individuals with joint SNP genotypes. To simulate that source population, we used the distribution of genotypes among controls from the Welcome Trust Case Control Consortium (WTCCC) study of Crohn's disease after imputation of missing SNP genotypes [Kooperberg et al., 2010].

Dr. Kooperberg kindly provided us the data on 2,938 controls, each with complete genotypes on 333,187 SNPs. The corresponding MAF distribution had mean 0.260 and standard deviation 0.13. The MAF for each SNP was regarded as fixed and equal to that in the 2,938 controls. To preserve LD, we generated a subject from the source population by independently sampling the joint SNP genotypes for each of the 22 pairs of autosomes. The joint SNP genotypes for each pair of autosomes were sampled with replacement from its set of 2,938 joint SNP genotypes in controls.

Distributions of Log Odds Ratios for Truly Disease-Associated SNPs

To use model 1, we needed to assign $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0045$ values to the disease-associated SNPs; other SNPs had $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0046$ . Under LD, we assigned disease-associated SNPs at random to the locations $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0047$ . We summed only over those disease-associated SNPs in equation 1. We used unpublished parameter estimates provided by Dr. Ju-Hyun Park (personal communication) for realistic distributions of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0048$ derived from data used in Figure 2 of Park et al. [2011]. For Crohn's disease, Dr. Park provided a range ( − 0.058, 0.058) within which nonnull $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0049$ cannot be detected (power less than 1%) by the largest available GWAS datasets; SNPs are said to be “unobservable” in this interval. The distribution of the observed $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0050$ was well fitted by a two-component normal mixture $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0051$ , where $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0052$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0053$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0054$ . Dr. Park provided these variance formulas, conditional on minor allele frequency, from data in Park et al. [2011]. To allow for the possibility that there are more SNPs in the unobservable range than implied by the two component mixture, we also considered a model that includes a third normal component in the unobservable range with mixture probability $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0055$ . The resulting mixed density is $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0056$ where now $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0057$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0058$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0059$ was chosen to assure that most (99.7%) of the mass of the third component was in the unobservable range. Supplementary Fig. S1 of the Appendix shows the marginal densities of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0060$ after averaging over $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0061$ .

Park et al. [2010] estimated the number of disease-associated SNPs, M₀, that will be found in observable ranges in future large GWAS. The probability $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0062$ that a disease-associated SNP is in the observable range corresponds to the integral of the density outside the vertical lines in Supplementary Fig. S1. Thus the total number of disease-associated SNPs, M, can be estimated from $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0063$ . To compute $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0064$ and hence M, we averaged the conditional probabilities given $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0065$ over 10⁶ random draws of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0066$ .

Analogous methods and results are given for prostate cancer in Appendix C and Supplementary Fig. S1 in the Appendix.

The distributions of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0067$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0068$ together with M₀ and π₀, determine the true number of disease-associated SNPs, M, and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0069$ , as well as the maximal achievable AUC (Section on Estimating the AUC under Linkage Equilibrium). The quantity σ² determines the heritability on the liability scale, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0070$ , and other properties(Appendices A and E and Supplementary Table S1 in the Appendix).

Estimating AUC Under LD From the Two-Phase Procedure

Having computed M, we assigned the disease-producing SNPs with their corresponding $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0071$ to random loci to create a genotype structure ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0072$ ), where only M of the $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0073$ are nonzero. We preserved this genotype structure in simulating the cases and controls for Phases 1, 2 and independent validation data. An individual with joint genotypes was sampled from the source population (Section on Simulating a Source Population with Joint SNP Genotypes), and a Bernoulli outcome with disease probability given by equation 1 was generated. The intercept in equation 1 was chosen such that the probability of disease was 0.01 in the source population. This process was repeated until $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0074$ cases and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0075$ controls were identified.

In Phase 1, we used $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0076$ cases and n₁ controls to compute P-values from marginal logistic regression models 2, and we selected all SNPs with P-values smaller than a set threshold. Our algorithm to remove SNPs in very high LD is in Appendix B.

Estimation in Phase 2 was based on n₂ cases and n₂ controls. The AUC was estimated based on scores 5 computed for each of the 400 cases and 400 controls in independent validation data (Section on Development (Phases 1 and 2) and Validation of Risk Model). The simulation flowchart is shown in Supplementary Figs. S2 and S3. To maximize the AUC for given set of case-control data, we estimated the AUC over a grid of values of (P-value threshold, λ) or (T, λ) and chose the largest AUC estimate. The distribution of the maximum AUC was estimated from 20 such estimates from 20 independently sampled genotype structures,( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0077$ ). To estimate the maximum theoretically achievable AUC, we replaced $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0078$ by the true values $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0079$ and used only the disease-associated loci in computing the score.

This simulation method can also be used to estimate AUC under linkage equilibrium, but we used a faster analytic method under linkage equilibrium (next section).

Estimating the AUC Under Linkage Equilibrium (Independence) for a Rare Disease

For rare diseases and linkage equilibrium, Gail et al. [2008] showed that conditional on ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0080$ ), SNP genotypes are independent in cases and controls if they are independent in the source population, and they gave the conditional genotype distribution for a SNP with given $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0081$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0082$ , both for cases and controls. Thus, the joint SNP genotype for each case or control could be obtained by drawing each SNP genotype independently, conditional on ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0083$ ). With linkage equilibrium, there is no need to remove correlated SNPs and one can use simulations as in the previous section to estimate AUC. However, we computed AUC by faster methods, based on asymptotic theory in Gail et al. [2008] instead. In unreported numerical studies, we showed that these faster methods yielded results in agreement with the simulation methods in the previous section.

Under linkage equilibrium, we can reorder SNPs so that the first M SNPs have $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0084$ while the remainder have $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0085$ , as in equation 1. Let $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0086$ if SNP i is selected in Phase 1 and 0 otherwise. We obtained Phase 2 estimates for the selected SNPs by drawing $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0087$ from a normal $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0088$ distribution for disease-associated SNPs and from $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0089$ for other SNPs, where $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0090$ depends on n₂, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0091$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0092$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0093$ depends on n₂ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0094$ (Appendix A.1.1). Let $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0095$ correspond to the effects for selected disease-associated SNPs and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0096$ the effects for all the other selected SNPs. In this notation the score for a person with genotypes $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0097$ is

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0098$ (6)

Conditional on $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0099$ , for large N and independent, bounded $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0100$ , the score is approximately normally distributed in cases and controls (Liapunov central limit theorem). The conditional moments of S in controls (S₀) and cases (S₁) are:

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0101$ (7)

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0102$ (8)

We calculated $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0103$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0104$ from the retrospective distribution of genotypes [Gail et al., 2008].

Conditional on $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0105$ , the AUC can be expressed in terms of the cumulative normal distribution function, and the unconditional AUC can be computed by averaging the conditional AUC over $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0106$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0107$ as

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0108$ (9)

where Φ denotes the cdf and ϕ the pdf of the standard normal distribution, and where $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0109$ is the joint probability mass function of the $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0110$ .

The distribution of z can be obtained analytically when SNPs are selected based P-value thresholding, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0111$ , because the $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0112$ are independent Bernoulli variates with parameters

$urn:x-wiley:07410395:gepi21762:equation:gepi21762-math-0113$ (10)

where $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0114$ has a noncentral $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0115$ distribution with noncentrality parameter $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0116$ . Here $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0117$ is computed like $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0118$ in Appendix A.1.1, but with n₁ replacing n₂. For non-disease-associated SNPs, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0119$ . When SNPs are selected by ranking, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0120$ can not be computed explicitly. Both for thresholding and ranking, we thus used a Monte Carlo integration algorithm to evaluate equation 9. The flowchart of the simulation procedure is in Supplementary Fig. S3, and the algorithm is in Appendix A.1.4.

To find the combination of (P-value threshold, λ) or (T, λ) that approximately maximized the AUC, we evaluated equation 9 over a grid of values of (P-value threshold, λ) or (T, λ). Independent Monte Carlo integrations were performed for each point on the grid and the point yielding the largest AUC value was determined. This process was repeated for 20 independently sampled genotype structures ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0121$ ) to learn about the distribution of the maximum AUC, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0122$ , and corresponding grid point.

To estimate the maximum theoretically achievable AUC, we used the true values ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0123$ ) to compute the mean and variance of the theoretically optimal score among cases, ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0124$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0125$ ), and among controls, ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0126$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0127$ ). The corresponding theoretically maximum AUC, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0128$ $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0129$ . $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0130$ was estimated as the average of 20 such estimates from 20 independent genotype structures ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0131$ ).

Single-Phase Model

In the single-phase model, we used all the training data both for the selection of SNPs and for estimation of effect sizes. Under LD we simulated genotype data as in the Section on Simulating a Source Population with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0132$ . We did not estimate $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0133$ in Phase 2, but instead used $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0134$ for SNP selection and to estimate AUC in independent validation data. Correlated SNPs were eliminated as in Appendix B. Under linkage equilibrium, instead of sampling the selection indicator $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0135$ directly, we drew $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0136$ from $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0137$ for all training data, where $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0138$ is a function of n₁ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0139$ . We calculated Wald statistics $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0140$ for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0141$ , drew $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0142$ from $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0143$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0144$ for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0145$ . We set $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0146$ if $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0147$ and 0 otherwise. In the validation data, we used $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0148$ from Phase 1 to compute the AUC using formula 9 with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0149$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0150$ .

Simulation Results

Rare Disease and Linkage Equilibrium

We begin with the case of linkage equilibrium to facilitate comparisons under LD in the next section. Park et al. [2010] estimated M₀ = 142 observable SNPs for Crohn's disease and somewhat more than 66 for prostate cancer. Both estimates have wide ranges of uncertainty. For each disease, we investigated $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0151$ 1,000 and 3,000 and 10,000 and 100,000, the choice of SNP selection criterion (P-value thresholding vs. ranking), $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0152$ or 0.6, and various choices of M₀ to determine which of these factors affected $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0153$ . To find $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0154$ , we searched over the set of P-value thresholds: $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0155$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0156$ . In figures, we index these thresholds on the log base 10 scale as $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0157$ P-values $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0158$ . For example the P-value threshold $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0159$ has index approximately 5.5. For ranking, we chose T from the set (1, 10, 30, 100, 200, 300, 1,000, 3,000). Note, for example, that if $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0160$ and all $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0161$ , the expected number of SNPs selected is 30 for P-value threshold $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0162$ . In either case, the allocation ratio λ was chosen from the set (0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95). Results in corresponding figures and tables are the means (with standard deviations) over 20 independent realizations of genotype structure ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0163$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0164$ ).

Figure 1 shows how $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0165$ varies with λ for various values of P-value threshold or T for Crohn's disease with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0166$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0167$ and for a single genotype structure ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0168$ ). For P-value thresholding, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0169$ is found for P-index = 3.5 (or P-threshold $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0170$ ) at $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0171$ . It is apparent that the majority of the training data should be allocated to SNP selection (Phase 1). Less stringent thresholds or more stringent thresholds yielded lower AUC. SNP selection based on ranking yielded similar results, with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0172$ at T=30 and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0173$ .

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Plots of AUC from equation 9 as a function of λ for different values of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0174$ or for different numbers of top-ranked SNPs, T, for one realization of genetic structure ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0175$ ) under linkage equilibrium. For Crohn's disease, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0176$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0177$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0178$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0179$ . For prostate cancer, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0180$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0181$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0182$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0183$ .

**Figure 1**
Open in figure viewer PowerPoint

Plots of AUC from equation 9 as a function of λ for different values of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0174$ or for different numbers of top-ranked SNPs, T, for one realization of genetic structure ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0175$ ) under linkage equilibrium. For Crohn's disease, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0176$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0177$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0178$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0179$ . For prostate cancer, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0180$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0181$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0182$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0183$ .

Table 1 presents averages over 20 realizations of ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0184$ ) for Crohn's disease with P-value thresholding. The case $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0185$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0186$ corresponds to a heritability $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0187$ ranging from 0.094 for disease probability 0.001 to 0.279 for disease probability 0.05 (Supplementary Table S1). For this case, there were on average $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0188$ truly disease-associated SNPs. Of these, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0189$ were selected in Phase 1 with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0190$ , and of those selected, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0191$ were truly disease-associated SNPs on average. In 12 of the 20 genotype structures studied, the maximum occurred at (P-threshold = $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0192$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0193$ ), and in eight instances at (P-threshold = $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0194$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0195$ ). The $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0196$ increased from 0.591 for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0197$ to 0.664 for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0198$ to 0.740 for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0199$ (Supplementary Table S2). Thus $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0200$ increased strongly with n. Larger samples led to the selection of more truly disease-associated SNPs on average ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0201$ for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0202$ and 774.1 for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0203$ (Supplementary Table S2)). The $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0204$ = 0.740 for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0205$ is very near the theoretical maximum, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0206$ = 0.756 (Supplementary Table S1). Similar results were found for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0207$ , which more than doubled the number of disease-associated SNPs, M, and increased the heritability by about 30% (Supplementary Table S1) but had little impact on the numbers of disease-associated SNPs selected nor on AUC (Table 1). Most of the additional disease-associated SNPs were too small to detect with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0208$ , although the $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0209$ increased to 0.782 (Supplementary Table S1). With $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0210$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0211$ = 0.753 (Supplementary Table S2). Using ranking instead of P-value thresholding yielded very similar results(Supplementary Tables S3 and S4).

Table 1. Average maximal AUC from P-value thresholding for Crohn's disease over 20 independent replications of the entire experiment, each with its sampled vectors ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0212$ ), under linkage equilibrium. Maximization is over $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0213$ and P-value threshold. Parameters that were varied include $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0214$ , M₀ and π₀. Total number of SNPs was $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0215$ . M₀ denotes the number of disease SNPs in the observable range; M denotes the total number of disease SNPs; $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0216$ denotes the average number of SNPs selected and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0217$ denotes the average number of truly disease-associated SNPs selected. $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0218$ denotes the number of times one combination of (λ, P-value) maximizes the AUC in 20 independent replications of the entire experiment

		$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0219$				$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0220$
		Two-phase model		Single-phase model		Two-Phase Model		Single-Phase Model
M₀		$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0221$	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0222$	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0223$	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0224$	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0225$	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0226$	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0227$	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0228$
100	AUC (sd)	0.560(0.016)	0.555(0.014)	0.570(0.017)	0.564(0.014)	0.616(0.014)	0.612(0.013)	0.622(0.013)	0.618(0.013)
	M (sd)	770.1 (0.605)	1,866.2 (1.20)	770.1 (0.605)	1,866.2 (1.20)	770.1(0.604)	1,866.2(1.20)	770.1 (0.605)	1,866.2 (1.20)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0229$ (sd)	10.9 (2.78)	11.2 (4.42)	7.46 (1.89)	6.75 (2.58)	38.0(9.74)	40.1(10.8)	27.7(3.23)	27.6(3.64)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0230$ (sd)	4.97 (1.58)	4.61 (1.41)	5.65 (1.71)	4.99 (1.64)	24.7(3.31)	25.5(5.12)	25.4(2.96)	25.6(3.64)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0231$ (λ, p)	17 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0232$	16 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0233$	17 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0234$	12 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0235$	10 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0236$	8 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0237$	19 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0238$	20 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0239$
		3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0240$	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0241$	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0242$	7 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0243$	6 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0244$	8 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0245$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0246$	-
		-	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0247$	-	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0248$	4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0249$	4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0250$	-	-
200	AUC (sd)	0.591(0.013)	0.591(0.017)	0.604(0.017)	0.591(0.017)	0.664(0.012)	0.662(0.016)	0.672(0.012)	0.670(0.016)
	M (sd)	1,539.6(1.10)	3,732.0(2.43)	1,539.6(1.10)	3,732.0(2.43)	1539.6(1.10)	3,732.0(2.43)	1,539.6(1.10)	3,732.0(2.43)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0251$ (sd)	26.6(9.59)	21.5(9.17)	16.5(5.35)	15.0(4.88)	82.7(23.0)	88.3(28.8)	66.4(5.58)	66.5(6.99)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0252$ (sd)	12.0(3.31)	10.9(3.41)	12.7(3.40)	11.8(3.26)	55.8(7.64)	57.0(9.21)	59.8(5.57)	59.9(6.99)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0253$ (λ, p)	12 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0254$	14 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0255$	12 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0256$	15 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0257$	13 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0258$	10 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0259$	20 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0260$	20 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0261$
		8 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0262$	6 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0263$	8 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0264$	5 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0265$	4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0266$	5 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0267$	-	-
		-	-	-	-	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0268$	5 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0269$	-	-
500	AUC (sd)	0.658(0.018)	0.660(0.016)	0.678(0.018)	0.680(0.017)	0.755(0.014)	0.755(0.013)	0.766(0.013)	0.766(0.013)
	M (sd)	3,848.3(2.53)	9,329.1(6.04)	3,848.3(2.53)	9,329.1(6.04)	3,848.3(2.53)	9,329.1(6.04)	3,848.3(2.53)	9,329.1(6.04)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0270$ (sd)	61.9(22.2)	53.5(4.09)	45.2(6.84)	42.9(4.54)	217.2(9.24)	254.1(73.1)	194.4(10.1)	202.8(11.4)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0271$ (sd)	35.2(6.70)	34.0(4.10)	37.3(4.69)	36.4(4.54)	151.3(9.24)	169.8(26.8)	174.6(10.1)	183.3(11.4)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0272$ (λ, p)	17 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0273$	20 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0274$	18 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0275$	20 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0276$	20 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0277$	17 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0278$	20 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0279$	20 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0280$
		3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0281$	-	2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0282$	-	-	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0283$	-	-

$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0284$ was consistently slightly larger for the single-phase strategy than for the two-phase strategy (Table 1 and Supplementary Table S2). The differences increased as M₀ increased, but decreased as n increased and were small for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0285$ . The optimal thresholds for the single-phase procedure were more stringent, resulting in smaller $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0286$ , and a smaller proportion of false positives, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0287$ . Thus, the single-phase approach yielded higher $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0288$ than the two-phase approach because it selected disease-associated SNPs better, even though estimates of $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0289$ from the single-phase method are biased by “winner's curse.”

To summarize, regardless of the procedure employed, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0290$ was substantially smaller than the theoretically achievable AUC (Supplementary Table S1) for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0291$ and 10, 000. Smaller samples ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0292$ ) yielded very low AUC (Supplementary Tables S2 and S4). Much larger samples like $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0293$ are required to approach the theoretical maximum (Supplementary Tables S2 and S4). Moreover, a simple one-phase procedure performed slightly better than the two-phase procedure, which was designed to eliminate bias.

Similar results were obtained for prostate cancer (Appendix C and Supplementary Tables S6–S9 in the Appendix).

Under linkage equilibrium, we relied on the “rare disease” assumption for theoretical calculations and used disease incidence $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0294$ in simulating case-control data. Unreported simulations of case-control data from cohorts with increased incidence indicate that AUC begins to decrease below the theoretical rare disease value for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0295$ but not for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0296$ , 0.01, or 0.015. We conclude that if disease incidence is less than 2%, the “rare disease” theory works well for a disease with genetic architecture similar to Crohn's disease. Even a common disease like breast cancer has an incidence of less than 2% over most 5-year-age intervals, which are often used for age-specific analyses.

We studied the effect of excluding SNPs that had been selected by P-value thresholding if $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0297$ , and in a separate study, if $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0298$ . We also studied the effects of adding SNPs with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0299$ , even if their P-values exceeded selection thresholds. Unreported simulations showed that using $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0300$ to select additional SNPs or to remove SNPs yielded lower AUC for Crohn's disease than P-value thresholding alone.

Linkage Disequilibrium in Crohn's Disease

Data were simulated as in the Section on Simulating a Source Population with Joint SNP Genotypes and highly correlated SNPs removed (Appendix B). We studied 20 independent realizations of ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0301$ ). For each realization we determined the choices of P-value threshold or T and λ that maximized AUC, and we presented the averaged results from the 20 realizations (Tables 2, 3, and Supplementary Table S4) for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0302$ .

Table 2. Two-phase model with linkage disequilibrium (LD): average maximal AUC from P-value thresholding for Crohn's disease over 20 independent replications of the entire experiment, each with its sampled vectors ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0303$ ). To preserve LD structure, joint genotypes in the source population were obtained by resampling autosomal genotypes independently from WTCCC controls. Maximization is over $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0304$ and P-value threshold. There were $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0305$ SNPs, and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0306$ cases and controls. Parameters that were varied include M₀ and π₀. M₀ denotes the number of disease SNPs in the observable range; M denotes the total number of truly disease-associated SNPs; $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0307$ (not shown) is the average number of SNPs selected initially in Phase 1; $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0308$ is the average number of SNPs selected after removing highly correlated ones ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0309$ ); and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0310$ denotes the average number of truly disease-associated SNPs selected. $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0311$ denotes the number of times one combination of (λ, P-value) maximizes the AUC in 20 independent replications of the entire experiment

		$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0312$		$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0313$
M₀		Multivariate	Univariate	Multivariate	Univariate
100	AUC (sd)	0.559(0.019)	0.561(0.018)	0.563(0.016)	0.564(0.015)
	M (sd)	770.1(0.447)	770.1(0.447)	1,866.4(0.813)	1,866.4(0.813)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0314$ (sd)	19.9(11.8)	47.5(48.5)	18.8(7.49)	43.6(44.7)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0315$ (sd)	3.33(0.81)	4.54(1.46)	3.32(1.19)	4.53(2.69)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0316$ (λ, p)	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0317$ , 4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0318$	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0319$ , 9 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0320$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0321$ ,1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0322$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0323$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0324$
		9 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0325$ , 2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0326$	5 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0327$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0328$	10 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0329$ , 3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0330$	4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0331$ , 4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0332$
		1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0333$ , 2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0334$	2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0335$	2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0336$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0337$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0338$ ,1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0339$
		1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0340$	-	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0341$	4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0342$ , 3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0343$
		-	-	-	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0344$
200	AUC (sd)	0.586(0.016)	0.587(0.015)	0.585(0.017)	0.585(0.015)
	M (sd)	1,539.7(0.801)	1,539.7(0.801)	3,732.3(1.69)	3,732.3(1.69)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0345$ (sd)	37.9(17.9)	88.3(47.4)	38.5(15.2)	150.2(140.1)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0346$ (sd)	6.53(1.92)	10.4(3.15)	6.82(2.08)	12.9(6.93)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0347$ (λ, p)	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0348$ , 3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0349$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0350$ , 7 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0351$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0352$ , 3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0353$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0354$ , 4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0355$
		2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0356$ , 9 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0357$	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0358$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0359$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0360$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0361$	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0362$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0363$
		1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0364$ ,1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0365$	7 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0366$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0367$	11 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0368$ , 2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0369$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0370$ ,4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0371$
		3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0372$	-	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0373$	4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0374$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0375$
		-	-		1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0376$
500	AUC (sd)	0.617(0.012)	0.624(0.010)	0.602(0.011)	0.608(0.009)
	M (sd)	3,848.6(1.79)	3,848.6(1.79)	9,329.9(4.35)	9,329.9(4.35)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0377$ (sd)	98.6(37.6)	815.1(601.2)	101.7(36.7)	1,162.3(774.3)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0378$ (sd)	16.3(4.09)	47.3(19.1)	16.8(3.73)	76.8(37.9)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0379$ (λ, p)	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0380$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0381$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0382$ , 3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0383$	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0384$ , 9 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0385$	2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0386$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0387$
		8 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0388$ , 8 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0389$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0390$ , 8 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0391$	6 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0392$ , 2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0393$	6 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0394$ , 4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0395$
		1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0396$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0397$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0398$	-	2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0399$ , 3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0400$
		-	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0401$ , 2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0402$	-	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0403$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0404$

The AUCs from univariate estimation exceeded those from multivariate estimation very slightly (Table 2), but statistically significantly (Supplementary Table S11). $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0405$ values were hardly changed by additional disease-associated SNPs in the nonobservable range ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0406$ ), indicating that $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0407$ is too small to extract this information. The optimal λ tended to be a smaller under LD for multivariate estimation than for univariate estimation and smaller than under independence, possibly because more information is required for multivariate estimates in Phase 2. The single-phase approach (Table 3) yielded larger maximal AUCs than the two-phase strategy (Table 2). The single-phase approach selected more disease-associated SNPs and proportionately fewer nondisease-associated SNPs.

Table 3. Single-phase model with linkage disequilibrium (LD): average maximal AUC from P-value thresholding for Crohn's disease over 20 independent replications of the entire experiment, each with its sampled vectors ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0408$ ). To preserve LD structure, joint genotypes in the source population were obtained by resampling autosomal genotypes independently from WTCCC controls. Maximization is over P-value threshold. There were $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0409$ SNPs, and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0410$ cases and controls. Parameters that were varied include M₀ and π₀. M₀ denotes the number of disease SNPs in the observable range; M denotes the total number of truly disease-associated SNPs; $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0411$ (not shown) is the average number of SNPs selected; $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0412$ is the average number of SNPs selected after removing highly correlated ones ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0413$ ); and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0414$ denotes the average number of truly disease-associated SNPs selected. $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0415$ denotes the number of times one combination of (P-value) maximizes the AUC in 20 independent replications of the entire experiment

M₀		$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0416$	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0417$
100	AUC (sd)	0.567(0.018)	0.569(0.016)
	M (sd)	770.1(0.447)	1,866.4(0.813)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0418$ (sd)	27.6(20.2)	39.4(50.0)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0419$ (sd)	4.89(1.50)	5.69(3.81)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0420$ (p)	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0421$ , 6 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0422$	3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0423$ , 3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0424$
		6 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0425$ , 6 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0426$	6 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0427$ , 4 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0428$
		1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0429$	2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0430$ , 2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0431$
200	AUC (sd)	0.595(0.015)	0.593(0.014)
	M (sd)	1,539.7(0.801)	3,732.3(1.69)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0432$ (sd)	86.6(82.1)	229.6(392.7)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0433$ (sd)	12.6(5.4)	18.9(12.8)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0434$ (p)	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0435$ , 5 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0436$	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0437$ , 3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0438$
		10 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0439$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0440$	2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0441$ , 10 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0442$
		3 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0443$	2 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0444$ , 1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0445$
		-	1 $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0446$

The $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0447$ values obtained under LD with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0448$ are slightly smaller than those under linkage equilibrium, especially for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0449$ (Tables 1, 2, and 3), possibly because the nondisease-associated SNPs in high LD with disease-associated SNPs are sometimes selected instead. Optimal univariate procedures have larger $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0450$ under LD (Tables 1 and 2). Ranking SNPs led to virtually the same average $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0451$ as P-value thresholding (Table 2 and Supplementary Table S4), although a few of the tiny differences were statistically significant (Supplementary Table S12).

The $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0452$ (with standard error) was estimated under LD for Crohn's disease. For $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0453$ , the estimates were 0.691 (0.0020), 0.756 (0.0019), and 0.856 (0.0011), respectively, for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0454$ and 500, and for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0455$ they were 0.713 (0.0021), 0.782 (0.0017), and 0.876 (0.0011). The $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0456$ values are slightly larger under independence (Supplementary Table S1) than with LD for M₀ = 200 and 500, because under LD the correlations among genotypes reduce σ².

In the presence of LD, large numbers of SNPs, including both disease-associated SNPs and their correlated neighbors, satisfy P-value threshold criteria for selection. Removing highly correlated SNPs as in Appendix B led to larger AUC values both for the univariate and multivariate procedures (Supplementary Table S13).

Data Example

Kooperberg et al. [2010] examined stepwise regression, lasso, elastic net, and other procedures to estimate model (1). They first used the Crohn's disease training data to select about 2,000 marginally most significant SNPs from among 333,187 SNPs. Then they applied the previous procedures to these 2,000 SNPs or subsets of them to build risk models with the training data, and tested them in independent data.

We compared our simple model building strategies with those in Kooperberg et al. [2010], using the same data. Following their approach, we randomly selected a training set (1,045 cases and 1,763 controls) and test set (703 cases and 1,175 controls) for estimating AUC. Based on findings in the Section on Simulation Results and the fact that the training sample was small, we chose $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0457$ and randomly divided the training set into two parts: 1, 045λ cases and 1, 763λ controls in Phase 1 and the remainder in Phase 2. We used both P-value thresholding (P-value = $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0458$ ) and ranking methods (T = (10, 30, 50, 100, 200, 300)) to select SNPs. All highly correlated SNPs with absolute correlation exceeding 0.95 were removed and both multivariate and univariate logistic regression models were applied. In the test data, we calculated AUC values. By repeating all the procedures including selection of the training and test sets 10 times, we obtained the mean AUCs by averaging over the 10 results for each scenario (Table 4). The highest AUC was achieved for P-value $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0459$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0460$ and ranged from 0.591 to 0.640 with mean 0.614 for multivariate fits. For univariate fits, the highest AUC was achieved for P-value $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0461$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0462$ and ranged from 0.619 to 0.647 with mean 0.632. The ranking method gave similar results. The highest AUC was achieved for T=50, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0463$ and ranged from 0.592 to 0.633 with mean 0.612 for multivariate fits, and for T=200, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0464$ and ranged from 0.612 to 0.648 with mean 0.632 for univariate fits. These results are very similar to the highest AUC estimate among those for 15 procedures, 0.637, in Table 1 of Kooperberg et al. [2010], even though we used different random splits into training and testing data. The maximum AUCs from the single-phase were similar or slightly larger (0.624–0.642 for P-value thresholding and 0.621–0.624 for ranking) than for the two-phase strategy or the methods in Kooperberg et al. [2010] (Table 4). These findings suggest that the most critical aspect of model building for prediction is initial SNP selection, and that many procedures will perform comparably well once promising SNPs are detected.

Table 4. WTCCC Crohn's disease data example: average estimates of AUC for various choices of P-value threshold (or number of top ranks, T) and λ. AUC results (with standard deviation in parentheses) are averaged over 10 different random allocations of the WTCCC data to training sets ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0465$ cases, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0466$ controls) and test sets ( $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0467$ cases, $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0468$ controls)

		$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0469$		$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0470$		$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0471$
Method		multivariate	univariate	multivariate	univariate	multivariate	univariate	Single-Phase Model
P-value	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0472$	-	0.618(0.009)	-	0.625(0.008)	-	0.617(0.008)	0.627(0.010)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0473$	0.588(0.012)	0.625(0.010)	0.584(0.019)	0.632(0.010)	0.563(0.019)	0.623(0.016)	0.633(0.010)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0474$	0.598(0.017)	0.621(0.011)	0.603(0.017)	0.627(0.010)	0.593(0.018)	0.630(0.017)	0.635(0.006)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0475$	0.596(0.023)	0.610(0.015)	0.614(0.015)	0.620(0.011)	0.607(0.015)	0.628(0.015)	0.632(0.008)
	$urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0476$	0.596(0.025)	0.603(0.018)	0.609(0.013)	0.614(0.011)	0.611(0.013)	0.622(0.014)	0.626(0.010)
Ranking	300	0.583(0.013)	0.624(0.011)	0.577(0.019)	0.630(0.008)	-	0.623(0.014)	0.631(0.008)
	200	0.590(0.013)	0.624(0.011)	0.590(0.019)	0.632(0.011)	0.574(0.017)	0.629(0.014)	0.634(0.010)
	100	0.596(0.016)	0.621(0.013)	0.600(0.016)	0.627(0.010)	0.597(0.0158)	0.629(0.016)	0.635(0.006)
	50	0.600(0.016)	0.614(0.013)	0.612(0.012)	0.621(0.011)	0.609(0.009)	0.628(0.014)	0.628(0.010)
	30	0.601(0.026)	0.610(0.017)	0.610(0.008)	0.617(0.010)	0.610(0.016)	0.623(0.017)	0.622(0.009)
	10	0.586(0.012)	0.588(0.012)	0.592(0.014)	0.594(0.013)	0.590(0.021)	0.596(0.020)	0.599(0.011)

Discussion

Chatterjee et al. [2013] showed that hundreds of thousands of GWAS samples are needed to extract most of the predictive information from SNPs. For many diseases, it is not feasible to assemble more than $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0477$ cases and controls. We used realistic distributions of log odds ratios per allele derived from GWAS data, to study model-building strategies for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0478$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0479$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0480$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0481$ . Our model parameters were consistent with estimates of heritability on the liability scale [Chatterjee et al., 2013; Lee et al., 2011]. Our studies led to the following conclusions: (1) A one-phase procedure that uses all the training data to both select SNPs and estimate SNP effects yields larger AUC values than a two-phase procedure that yields unbiased estimates of log-odds ratios. (2) If one desires unbiased estimates of log-odds ratios, one can use a two-phase procedure, but one should allocate most of the training data to SNP selection (Phase 1). (3) Similar $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0482$ values are obtained whether SNPs are selected by P-value thresholding or by ranking P-values. (4) Despite the fact that there are thousands of SNPs with small log-odds ratios that could potentially improve discriminatory accuracy, for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0483$ or fewer cases and controls, one should only select tens or hundreds of SNPs to achieve the highest AUC. (5) Univariate estimation yields higher AUC values than multivariate modeling, even under LD. (6) Under LD, it is useful to filter out highly correlated SNPs, both for univariate and multivariate estimation. (7) Similar conclusions apply to maximizing the probability of correct classification [Liu et al., 2012], PCC, as to maximizing AUC, because AUC is nearly linearly related to PCC (Appendix D and Supplementary Table S14 in the Appendix).

Our work addresses specific modeling choices, as indicated in the previous paragraph, and also has some implications for the potential utility of SNPs for risk modeling. A number of papers have focused on the potential utility of SNPs for risk prediction. Some of this literature describes the potential heritability explicable by SNPs, which is substantially greater than that attributable to previously discovered disease-associated SNPs [Lee et al., 2011; Purcell et al., 2009; Stahl et al., 2012; Yang et al., 2010]. Other work describes the discriminatory accuracy (AUC) that SNPs could potentially provide, based on the following assumptions: (1) a joint risk model for SNPs at various loci, such as the logistic main effects model; (2) distribution of allele frequencies; (3) Hardy–Weinberg equilibrium; (4) linkage equilibrium; (5) and distributions of SNP effects, either explicit or implied by heritability assumptions. Under these assumptions, AUC can be estimated, either analytically as in [Gail, 2008; Moonesinghe et al., 2009], or Wray et al. [2010], or by simulations as in Janssens et al. [2006], Wray et al. [2007], and Pepe et al. [2010]. Although this work demonstrates the difficulty of achieving high AUC with SNPs, it assumes that the disease-associated SNPs and their effect sizes are known and does not investigate the model-building process. In particular, this work does not account for the degradation in performance that arises from uncertainties in model-fitting, including selection of informative SNPs and estimation of their associated effects. Chatterjee et al. [2013] addressed these concerns and used asymptotic theory to show that very large samples are required to overcome these uncertainties. The present paper complements that of Chatterjee et al. [2013] by evaluating the performance of one- and two-phase model-fitting procedures theoretically and by simulation in GWAS sample sizes ranging from $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0484$ (1,000 cases and 1,000 controls) to $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0485$ . Moreover, the present paper studies performance not only under linkage equilibrium, as in Chatterjee et al. [2013], but also under linkage disequilibrium. Our work shows that for “small” samples such as $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0486$ , including more than 100 SNPs degraded performance, both for Crohn's disease and prostate cancer. These findings are driven by empirical data on the distribution of log odds ratios for Crohn's disease and prostate cancer per allele as in Park et al. [2011]. Kang et al. [2010] found similar results under linkage equilibrium when they used an odds ratio distribution with median odds ratio 1.13.

Other researcher have obtained insight into model-building procedures by analyzing real data examples in various ways, rather than by simulations to study operating characteristics. For example, Kooperberg et al. [2010] compared the performance of lasso and other model-building procedures on the WTCCC Crohn's disease data and recommended filtering highly correlated SNPs. Wei et al. [2009] obtained promising AUC values for type 1 diabetes by using a support vector machine to analyze SNPs that were preselected with a liberal-fixed threshold (e.g. $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0487$ ).

We studied the two-phase procedure because it yields unbiased estimates of log odds ratios and because we thought that it might outperform a one-stage procedure that yields estimates biased away from zero by virtue of the “winner's curse” [Zöllner and Pritchard, 2007]. If the disease-associated SNPs were known, then using unbiased estimates would yield a higher AUC than using biased estimates with similar precision. Thus, we hypothesized that for large enough sample sizes where SNP selection is adequate in phase 1 and log odds are precisely estimated in phase 2, unbiased estimation might lead to improved prediction. In fact, however, the one-stage procedure out-performed the two-stage procedure for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0488$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0489$ and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0490$ . For $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0491$ , the one-phase procedure was better, but differences were very slight. This finding emphasizes the importance of correct SNP selection, rather than unbiased estimation of log odds ratios, as also indicated by our finding that most of the cases and controls should be allocated to phase 1 in a two-phase procedure. Kang et al. [2010] found that simply counting the number of adverse alleles yielded AUCs as high or higher than using $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0492$ for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0493$ and median odds ratio 1.13. Other methods for reducing bias [e.g. Bowden and Dudbridge, 2009; Zhong and Prentice, 2010; Zöllner and Pritchard, 2007] might use the data more efficiently and outperform the two-phase approach.

Well calibrated risk models with modest discriminatory accuracy such as AUC = 0.6 can be used for assessing risks in populations, designing prevention trials, and weighing the risks and benefits of interventions Gail [2011] and can improve the efficiency of screening programs, compared to programs based on age alone [Pashayan et al., 2011]. Higher levels of discriminatory accuracy are required for many screening applications [Gail and Pfeiffer, 2005]. Our data for Crohn's disease indicate that SNPs alone will yield an AUC of only about 0.6 for a GWAS with $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0494$ cases and about 0.7 for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0495$ . With $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0496$ , the AUC might be 0.8. Because Crohn's disease is rare, the positive predictive value of tests in the general population would be low, even with AUC = 0.8. Such a genetic risk tool might have utility in a clinic for gastrointestinal disease where the prevalence of Crohn's disease is higher, but diagnostic tests usually require much higher sensitivity and specificity. For prostate cancer, we found AUCs of about 0.6, 0.65, and 0.68, respectively, for $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0497$ , $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0498$ , and $urn:x-wiley:07410395:gepi21762:media:gepi21762-math-0499$ . Park et al. [2012] estimated that a GWAS three times as large as the largest one to date would yield an AUC of 0.676 and that combining SNP information with family history would yield AUC = 0.694. However, whether one should screen for a disease-like prostate cancer also depends on the risks and benefits of available interventions, as is clear from the US Preventive Services Task Force recommendation against screening with prostate-specific antigen [Chou et al., 2011]. For diseases such as breast cancer for which other strong risk factors like mammographic density, family history, and history of biopsies are available, adding SNP information to models containing such factors may achieve AUC levels of 0.7 or more. It is likely that such models can be quite useful, for example in deciding whether certain young women have high enough risks to warrant screening mammography. Jostins and Barrett [2011] discuss other aspects of the potential clinical utility of genetic risk prediction.

Acknowledgments

This work was supported by the intramural research program of the National Cancer Institute, Division of Cancer Epidemiology, and Genetics. We thank Drs. Ju-Hyun Park and Nilanjan Chatterjee for providing the formulas for the conditional densities of log odds ratios per allele given minor allele frequency, leading to the distributions in Supplementary Fig. S1. We thank Dr. Charles Kooperberg for providing the Crohn's disease data with imputed genotypes from Kooperberg et al. [2010] and the Welcome Trust Case Control Consortium for access to the data. We thank the reviewers for helpful comments that improved the paper.

Supporting Information

References

Bowden J, Dudbridge F. 2009. Unbiased estimation of odds ratios: combining genome-wide association scans with replication studies. Genet Epidemiol 33: 406–418.
10.1002/gepi.20394
PubMed Web of Science® Google Scholar
Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park J. 2013. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet 45: 1–6.
10.1038/ng.2579
CAS PubMed Web of Science® Google Scholar
Chou R, Croswell JM, Dana T, Bougatsos C, Blazina I, Fu R, Gleitsmann K, Koenig HC, Lam C, Maltz A and others. 2011. Screening for prostate cancer: a review of the evidence for the us preventive services task force. Annal Internal Med 155: 762–771.
10.7326/0003-4819-155-11-201112060-00375
PubMed Web of Science® Google Scholar
Gail MH. 2008. Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. J Natl Cancer Inst 100: 1037–1041.
10.1093/jnci/djn180
CAS PubMed Web of Science® Google Scholar
Gail MH. 2011. Personalized estimates of breast cancer risk in clinical practice and public health. Stat Med 30: 1090–1104.
10.1002/sim.4187
PubMed Web of Science® Google Scholar
Gail MH, Pfeiffer RM. 2005. On criteria for evaluating models of absolute risk. Biostatistics 6: 227–239.
10.1093/biostatistics/kxi005
PubMed Web of Science® Google Scholar
Gail MH, Pfeiffer RM, Wheeler W, Pee D. (2008). Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics 9: 201–215.
10.1093/biostatistics/kxm032
PubMed Web of Science® Google Scholar
Janssens ACJ, Aulchenko YS, Elefante S, Borsboom GJ, Steyerberg EW, van Duijn CM. 2006. Predictive testing for complex diseases using multiple genes: fact or fiction? Genet Med 8: 395–400.
10.1097/01.gim.0000229689.18263.f4
PubMed Web of Science® Google Scholar
Jostins L, Barrett JC. 2011. Genetic risk prediction in complex disease. Hum Mol Genet 20: R182–R188.
10.1093/hmg/ddr378
CAS PubMed Web of Science® Google Scholar
Kang J, Cho J, Zhao H. 2010. Practical issues in building risk-predicting models for complex diseases. J Biopharm Stat 20: 415–440.
10.1080/10543400903572829
PubMed Web of Science® Google Scholar
Kooperberg C, LeBlanc M, Obenchain V. 2010. Risk prediction using genome-wide association studies. Genet Epidemiol 34: 643–652.
10.1002/gepi.20509
CAS PubMed Web of Science® Google Scholar
Lee SH, Wray NR, Goddard ME, Visscher PM. 2011. Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet 88: 294–305.
10.1016/j.ajhg.2011.02.002
CAS PubMed Web of Science® Google Scholar
Liu X, Wang Y, Rekaya R, Sriram TN. 2012. Sample size determination for classifiers based on single-nucleotide polymorphisms. Biostatistics 13: 217–227.
10.1093/biostatistics/kxr053
PubMed Web of Science® Google Scholar
Moonesinghe R, Liu T, Khoury MJ. 2009. Evaluation of the discriminative accuracy of genomic profiling in the prediction of common complex diseases. Eur J Hum Genet 18: 485–489.
10.1038/ejhg.2009.209
PubMed Web of Science® Google Scholar
Park JH, Gail MH, Greene MH, Chatterjee N. 2012. Potential usefulness of single nucleotide polymorphisms to identify persons at high cancer risk: an evaluation of seven common cancers. J Clin Oncol 30: 2157–2162.
10.1200/JCO.2011.40.1943
PubMed Web of Science® Google Scholar
Park JH, Gail MH, Weinberg CR, Carroll RJ, Chung CC, Wang Z, Chanock SJ, Fraumeni JF Jr, Chatterjee N. 2011. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci 108: 18026–18031.
10.1073/pnas.1114759108
CAS PubMed Web of Science® Google Scholar
Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, Chatterjee N. 2010. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet 42: 570–575.
10.1038/ng.610
CAS PubMed Web of Science® Google Scholar
Pashayan N, Duffy SW, Chowdhury S, Dent T, Burton H, Neal DE, Easton DF, Eeles R, Pharoah P. 2011. Polygenic susceptibility to prostate and breast cancer: implications for personalised screening. Br J Cancer 104: 1656–1663.
10.1038/bjc.2011.118
CAS PubMed Web of Science® Google Scholar
Pepe MS, Gu JW, Morris DE. 2010. The potential of genes and other markers to inform about risk. Cancer Epidemiol Biomarkers Prev 19: 655–665.
10.1158/1055-9965.EPI-09-0510
CAS PubMed Web of Science® Google Scholar
Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, Sklar P, Ruderfer DM, McQuillin A, Morris DW and others. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460: 748–752.
10.1038/nature08185
CAS PubMed Web of Science® Google Scholar
Stahl EA, Wegmann D, Trynka G, Gutierrez-Achury J, Do R, Voight BF, Kraft P, Chen R, Kallberg HJ, Kurreeman FA and others. 2012. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet 44: 483–489.
10.1038/ng.2232
CAS PubMed Web of Science® Google Scholar
Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, Kim C, Frackleton E, Hou C, Glessner JT, Chiavacci R and others. 2009. From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet 5: e1000678.
10.1371/journal.pgen.1000678
CAS PubMed Web of Science® Google Scholar
Wray NR, Goddard ME, Visscher PM. 2007. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res 17: 1520–1528.
10.1101/gr.6665407
CAS PubMed Web of Science® Google Scholar
Wray NR, Yang J, Goddard ME, Visscher PM. (2010). The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet 6: e1000864.
10.1371/journal.pgen.1000864
CAS PubMed Web of Science® Google Scholar
Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW and others. 2010. Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42: 565–569.
10.1038/ng.608
CAS PubMed Web of Science® Google Scholar
Zhong H, Prentice RL. 2010. Correcting “winner's curse” in odds ratios from genome-wide association findings for major complex human diseases. Genet Epidemiol 34: 78–91.
10.1002/gepi.20437
PubMed Web of Science® Google Scholar
Zöllner S, Pritchard JK. 2007. Overcoming the winners curse: estimating penetrance parameters from case-control data. Am J Hum Genet 80: 605–615.
10.1086/512821
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume37, Issue8

December 2013

Pages 768-777

Strategies for Developing Prediction Models From Genome-Wide Association Studies

Abstract

Introduction

Risk Model and Two-Phase Procedure

Risk Model

Development (Phases 1 and 2) and Validation of Risk Model

Phase 1: SNP Selection

Phase 2: Estimating Log Odds Ratios

Model Validation: Score Computation and AUC Estimation

Simulating Case-Control Data to Estimate the AUC

Simulating a Source Population With Joint SNP Genotypes in the Presence of LD

Distributions of Log Odds Ratios for Truly Disease-Associated SNPs

Estimating AUC Under LD From the Two-Phase Procedure

Estimating the AUC Under Linkage Equilibrium (Independence) for a Rare Disease

Single-Phase Model

Simulation Results

Rare Disease and Linkage Equilibrium

Linkage Disequilibrium in Crohn's Disease

Data Example

Discussion

Acknowledgments

Supporting Information

References

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Strategies for Developing Prediction Models From Genome-Wide Association Studies

Abstract

Introduction

Risk Model and Two-Phase Procedure

Risk Model

Development (Phases 1 and 2) and Validation of Risk Model

Phase 1: SNP Selection

Phase 2: Estimating Log Odds Ratios

Model Validation: Score Computation and AUC Estimation

Simulating Case-Control Data to Estimate the AUC

Simulating a Source Population With Joint SNP Genotypes in the Presence of LD

Distributions of Log Odds Ratios for Truly Disease-Associated SNPs

Estimating AUC Under LD From the Two-Phase Procedure

Estimating the AUC Under Linkage Equilibrium (Independence) for a Rare Disease

Single-Phase Model

Simulation Results

Rare Disease and Linkage Equilibrium

Linkage Disequilibrium in Crohn's Disease

Data Example

Discussion

Acknowledgments

Supporting Information

References

Citing Literature

Figures

References

Related

Information