Volume 35, Issue 8 pp. 781-789
Original Article
Full Access

Defining the power limits of genome-wide association scan meta-analyses

Kay Chapman

Kay Chapman

Wellcome Trust Centre for Human Genetics, Roosevelt Drive, University of Oxford, Oxford, United Kingdom

Botnar Research Centre, Institute of Musculoskeletal Sciences, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Windmill Road, Oxford, United Kingdom

Search for more papers by this author
Teresa Ferreira

Teresa Ferreira

Wellcome Trust Centre for Human Genetics, Roosevelt Drive, University of Oxford, Oxford, United Kingdom

Search for more papers by this author
Andrew Morris

Andrew Morris

Wellcome Trust Centre for Human Genetics, Roosevelt Drive, University of Oxford, Oxford, United Kingdom

Search for more papers by this author
Jennifer Asimit

Jennifer Asimit

The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom

Search for more papers by this author
Eleftheria Zeggini

Corresponding Author

Eleftheria Zeggini

The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom

Wellcome Trust Sanger Institute, The Morgan Building, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1HH, UK===Search for more papers by this author
First published: 15 September 2011
Citations: 17

Abstract

Large-scale meta-analyses of genome-wide association scans (GWAS) have been successful in discovering common risk variants with modest and small effects. The detection of lower frequency signals will undoubtedly require concerted efforts of at least similar scale. We investigate the sample size-dictated power limits of GWAS meta-analyses, in the presence and absence of modest levels of heterogeneity and across a range of different allelic architectures. We find that data combination through large-scale collaboration is vital in the quest for complex trait susceptibility loci, but that effect size heterogeneity across meta-analyzed studies drawn from similar populations does not appear to have a profound effect on sample size requirements. Genet. Epidemiol. 2011. © 2011 Wiley Periodicals, Inc. 35:781-789, 2011

INTRODUCTION

The advent of genome-wide association scans (GWAS) has undoubtedly changed the field of complex trait genetics dramatically over the last few years. GWAS became feasible following an alignment of possibilities, including large-scale sample size availability, an improved understanding of human genome sequence variation, advances in high-accuracy, high-throughput genotyping technologies, and a deeper appreciation of analytical considerations for the interpretation of data. The first wave of GWAS led to the successful identification of multiple common variants and, in general, reaped the low-hanging fruit, i.e. variants closely tagging causal alleles with modest to large effect sizes [WTCCC, 2007]. However, individual studies are limited in power by finite sample sizes. Wide collaborative networks culminated in consortium formation with the purpose of meta-analyzing at the genome-wide scale, increasing sample size without the need for de novo genotyping. By synthesizing summary association statistics across multiple GWAS, power is increased and novel discoveries have been made, for example in type 2 diabetes, colorectal cancer, coronary artery disease, and fat distribution [Voight et al., 2010; Houlston et al., 2010; Preuss et al., 2010; Heid et al., 2010].

The vast majority of GWAS meta-analyses to date have focused on populations of similar ancestry (primarily European), circumventing the complications of locus heterogeneity that can have profound effects on power when combining data across different genetic architectures. Statistical heterogeneity is a challenge not only for meta-analyses across diverse populations but also across similar-ancestry data sets when effect sizes are dissimilar. Allelic heterogeneity, which is conceivably the case for loci containing multiple low frequency/rare variants, represents an additional challenge.

The field is poised to continue with ever-increasing large-scale GWAS meta-analyses in parallel to entering the era of next generation association studies involving whole-exome or whole-genome sequencing. In this study, we investigate when GWAS meta-analyses may reach a point of diminishing returns in terms of power to detect disease-associated variants, in the presence and absence of modest levels of heterogeneity and across a range of different allelic architectures.

Replication of findings before declaring success, i.e. before claiming the detection of established disease associations, has become the sine qua non in GWAS. The larger the discovery sample, the smaller the effect sizes that a GWAS meta-analysis can detect. This factor, in combination with winner's curse (i.e. the fact that the originally identified effect size is likely to be overestimated in comparison to its true value), can translate into a requirement for replication data sets that supersede the discovery set in sample size. For current meta-analysis efforts that involve several tens of thousands of samples (even surpassing 100,000, for example for quantitative anthropometric traits such as height), identifying appropriately sized replication data sets can be extremely challenging. In this study, we also explore the combined power of the discovery (Stage 1) and replication (Stage 2) sets to detect associations at genome-wide significance, with a focus on sample size requirements.

METHODS

We obtained estimates for the sample sizes required for a range of genetic effect sizes and different risk allele frequencies, keeping power constant at 80%. We investigated different scenarios involving a combination of Stage 1 (discovery) and Stage 2 (replication) samples, to mimic realistic GWAS meta-analysis studies and their followup. We considered a sample of cases and an equal number of controls. This 1:1 ratio was kept constant throughout all simulations for Stages 1 and 2 and when considering the effects of heterogeneity. The disease trait was set to have a population risk of 0.05. We assumed that we were measuring the effect of the causal variant itself and examined power under the additive model in the log-odds ratio of the risk allele.

STAGE 1 (DISCOVERY)

Using the software package Quanto version 1.2.4 [Gauderman, 2002], we estimated the sample sizes required for 12 different ORs ranging from 1.05 to 3.0 (with increments of 0.05) and 15 different risk allele frequencies ranging from 0.001 to 0.5 (with initial increments of 0.001, followed by 0.01 then 0.1). We applied three different thresholds of significance: P<1 × 10−4, P<1 × 10−5, and P<5 × 10−8 (genome-wide significant; Fig. 1).

Details are in the caption following the image

Overview of study design. Stage 1 sample sizes were estimated using the following design parameters: population risk of 0.05, additive inheritance model, case-control ratio of 1:1, 80% power. Replication sample sizes were established using the following design parameters: 80% power to detect 100 signals from a GWAS of 2 million markers, α = 5 × 10−8.

STAGE 2 (REPLICATION)

Using the software package CaTS [Skol et al., 2006], we assumed that 100 signals (0.005%) from 2 million markers of a GWAS meta-analysis (for example, using HapMap data to impute genotypes at untyped variants) were selected for replication in Stage 2. This number of markers is representative of the first followup step for most GWAS meta-analysis efforts. The sample size necessary to replicate Stage 1 signals at a final significance threshold of 5 × 10−8 in a combined meta-analysis was calculated for each of the scenarios for each of the three Stage 1 significance thresholds. A combined P-value surpassing genome-wide significance is typically taken as evidence for robust association at disease-associated loci. Sample sizes could not be estimated beyond a threshold of 1,000,000. A total of 540 scenarios were examined.

HETEROGENEITY

We selected three of the above scenarios for further investigation through simulations incorporating heterogeneity. These were:
  • (1)

    Allele freq = 0.30, OR = 1.25 (common SNP, moderate effect size)

  • (2)

    Allele freq = 0.05, OR = 1.25 (low-frequency SNP, moderate effect size)

  • (3)

    Allele freq = 0.01, OR = 3.0 (low-frequency SNP, large effect size)

Evidence for heterogeneity of genetic effects is commonly investigated using two statistics: Cochran's Q statistic of homogeneity and the I2 measure [Higgins et al., 2003]. We used I2 to quantify the effect of heterogeneity. The I2 measure has a very intuitive interpretation and allows assessing statistical significance and the extent of heterogeneity simultaneously. Different extents of heterogeneity were considered for each of the three scenarios. We assigned adjectives of none, low, moderate, and high heterogeneity to I2 values less than 15%, between 15 and 35%, between 35 and 70%, and greater than 70%, respectively (calculated as the mean across all simulations carried out for the scenario considered) [Higgins et al., 2003].

For our simulations, we assumed a meta-analysis of 10 studies with equal sample size and with a 1:1 case-control ratio. Disease prevalence was fixed at 0.05 (as above). For a given scenario, the allele frequency was fixed, while the ORs varied across studies to produce the selected, combined meta-analysis OR. The amount of variability depends on the level of heterogeneity considered in the scenario under analysis.

In each scenario, 10,000 replicates of data were simulated. In each replicate, and for each of the 10 studies, the genotypes of the SNP were simulated independently based on the allele frequency considered and assuming Hardy-Weinberg equilibrium. Conditional upon the genotype data, the case-control status was simulated according to the additive model. The association analysis was then performed in a logistic regression modeling framework. Based on this, fixed-effects meta-analysis was carried out, including a test of heterogeneity.

Based on these simulations, power to detect association at a significance level of 5 × 10−8 was estimated, with mean I2 determined to evaluate heterogeneity. These simulations were repeated for different sample sizes in order to verify which would reach a detection power of 80%.

RESULTS

Estimated sample sizes for Stages 1 and 2 are given for each of Stage 1 significance thresholds of 1 × 10−4 (Table I), 1 × 10−5 (Table II), and 5 × 10−8 (Table III) in the absence of heterogeneity. As expected, the results demonstrate a relationship between sample size, risk allele frequency (RAF), and the size of the genetic effect (Figs. 2, 3, and 4). As the genetic variant becomes rarer and the odds ratio becomes smaller, the sample size required to detect the signal becomes increasingly large.

Details are in the caption following the image

Sample sizes required to reach 80% power in Stage 1 (A) and (C) for α = 1 × 10−4, and Stage 2 (B) and (D) for an overall α = 5 × 10−8 across a range of ORs and risk allele frequencies. A and B present results for lower risk allele frequencies (up to 0.05); C and D present results for common risk allele frequencies (0.05–0.5).

Details are in the caption following the image

Sample sizes required to reach 80% power in Stage 1 (A) and (C) for α = 1 × 10−5, and Stage 2 (B) and (D) for an overall α = 5 × 10−8 across a range of ORs and risk allele frequencies. A and B present results for lower risk allele frequencies (up to 0.05); C and D present results for common risk allele frequencies (0.05–0.5).

Details are in the caption following the image

Sample sizes required to reach 80% power in Stage 1 for α = 5 × 10−8 across a range of ORs and risk allele frequencies. A presents results for lower risk allele frequencies (up to 0.05); B presents results for common risk allele frequencies (0.05–0.5).

Table I. Sample size for Stages 1 and 2 for a range of RAFs and ORs
RAF
OR 0.001 0.002 0.003 0.004 0.005 0.01 0.02 0.03 0.04 0.05 0.1 0.2 0.3 0.4 0.5
1.05 Stage 1 >1,000,000 >1,000,000 >1,000,000 >1,000,000 >1,000,000 930,100 470,000 316,700 240,100 194,200 102,700 58,000 44,400 39,000 37,600
Stage 2 760,991 384,545 259,118 196,445 158,891 84,027 47,455 36,327 31,909 30,764
1.1 Stage 1 >1,000,000 >1,000,000 790,000 593,100 475,000 238,800 120,700 8,141,400 61,700 50,000 26,500 15,000 11,600 10,200 9,900
Stage 2 646,364 485,264 388,636 195,382 98,755 66,600 50,482 40,909 21,682 12,273 9,491 8,345 8,100
1.15 Stage 1 >1,000,000 539,800 360,300 270,500 216,600 108,900 55,100 37,200 28,200 22,800 12,100 6,900 5,300 4,700 4,600
Stage 2 441,655 294,791 221,318 177,218 89,100 45,082 30,436 23,073 18,655 9,900 5,645 4,336 3,845 3,764
1.2 Stage 1 621,900 311,300 207,800 156,000 125,000 62,900 31,800 21,500 16,300 13,200 7,000 4,000 3,100 2,800 2,700
Stage 2 508,827 254,700 170,018 127,636 102,273 51,464 26,018 17,591 13,336 10,800 5,727 3,273 2,536 2,291 2,209
1.25 Stage 1 407,800 204,100 136,300 102,300 82,000 41,200 20,900 14,100 10,700 8,700 4,600 2,700 2,100 1,850 1,800
Stage 2 333,655 166,991 111,518 83,700 67,091 33,709 17,100 11,536 8,755 7,118 3,764 1,800 1,400 1,514 1,473
1.3 Stage 1 290,000 145,200 96,900 72,800 58,300 29,300 14,900 10,000 7,600 6,200 3,300 1,900 1,500 1,350 1,300
Stage 2 237,273 118,800 79,282 59,564 47,700 23,973 9,933 8,182 6,218 4,133 2,700 1,555 1,227 900 1,064
1.35 Stage 1 218,000 109,200 72,900 54,700 43,800 22,100 11,200 7,600 5,700 4,700 2,500 1,400 1,100 1,000 1,000
Stage 2 178,364 89,345 48,600 36,467 29,200 14,733 7,467 5,067 4,664 3,133 1,667 1,400 1,100 1,000 818
1.4 Stage 1 170,700 85,500 57,100 42,900 34,300 17,300 8,800 5,900 4,500 3,700 2,000 1,100 900 800 800
Stage 2 139,664 69,955 38,067 28,600 22,867 11,533 5,867 3,933 3,000 2,467 1,333 1,100 600 655 655
1.5 Stage 1 114,200 57,200 38,200 28,700 23,000 11,600 5,900 4,000 3,000 2,700 1,300 800 600 550 550
Stage 2 93,436 38,133 25,467 19,133 15,333 7,733 3,933 2,667 2,000 1,667 1,064 431 491 450 450
2 Stage 1 34,700 17,400 11,600 8,700 7,000 3,500 1,800 1,200 900 800 400 250 200 200 200
Stage 2 23,133 11,600 7,733 5,800 4,667 2,333 1,200 800 736 431 327 167 133 108 133
2.5 Stage 1 18,200 9,100 6,100 4,600 3,700 1,900 950 650 500 400 250 150 100 100 100
Stage 2 12,133 6,067 3,285 2,477 1,992 1,023 512 350 269 267 83 64 300 122 300
3 Stage 1 11,800 5,900 4,000 3,000 2,400 1,200 600 400 350 300 150 100 100 100 100
Stage 2 6,354 3,177 2,154 1,615 1,292 646 400 267 117 100 81 43 18 18 25
  • Sizes are calculated for detection of 80% power at a significance level of 1 × 10−4 for Stage 1 and a combined Stages 1 and 2 significance level of 5 × 10−8. Estimates are based on assumptions of 1:1 case:control ratios and a population risk of 0.05. RAF, risk allele frequency.
Table II. Sample size for Stages 1 and 2 for a range of RAFs and ORs
RAF
OR 0.001 0.002 0.003 0.004 0.005 0.01 0.02 0.03 0.04 0.05 0.1 0.2 0.3 0.4 0.5
1.05 Stage 1 >1,000,000 >1,000,000 >1,000,000 >1,000,000 >1,000,000 >1,000,000 580,400 391,100 296,500 239,800 126,900 71,700 54,900 48,200 46,500
Stage 2 >1,000,000 193,467 130,367 98,833 79,933 42,300 23,900 18,300 16,067 15,500
1.1 Stage 1 >1,000,000 >1,000,000 975,600 732,500 586,600 294,900 149,100 100,500 76,200 61,700 32,700 18,600 14,300 12,600 12,200
Stage 2 325,200 244,167 195,533 98,300 49,700 33,500 25,400 20,567 10,900 6,200 4,767 4,200 4,067
1.15 Stage 1 >1,000,000 666,600 444,900 334,000 267,500 134,500 68,000 45,900 34,800 28,200 15,000 8,500 6,600 5,800 5,700
Stage 2 222,200 148,300 111,333 89,167 44,833 22,667 15,300 11,600 9,400 5,000 2,833 2,200 1,933 1,800
1.2 Stage 1 768,000 384,400 256,600 192,700 154,300 77,600 39,300 26,500 20,100 16,300 8,700 5,000 3,900 3,400 3,350
Stage 2 192,000 96,100 64,150 48,175 38,575 25,867 9,825 8,833 6,700 4,075 2,175 1,250 975 1,133 1,117
1.25 Stage 1 503,600 252,100 168,300 126,400 101,200 50,900 25,800 17,400 13,200 10,700 5,700 3,300 2,600 2,300 2,250
Stage 2 125,900 63,025 42,075 31,600 25,300 12,725 6,450 4,350 3,300 2,675 1,900 825 650 575 750
1.3 Stage 1 358,100 179,300 119,700 89,900 72,000 36,200 18,300 12,400 9,400 7,600 4,100 2,400 1,800 1,650 1,600
Stage 2 89,525 44,825 29,925 22,475 18,000 9,050 4,575 3,100 2,350 1,900 1,025 600 600 550 533
1.35 Stage 1 269,300 134,800 90,000 67,600 54,100 27,200 13,800 9,300 7,100 5,800 3,100 1,800 1,400 1,250 1,250
Stage 2 67,325 33,700 22,500 16,900 13,525 6,800 3,450 2,325 1,775 1,450 775 450 350 417 313
1.4 Stage 1 210,900 105,600 70,500 52,900 42,400 21,300 10,800 7,300 5,600 4,500 2,400 1,400 1,100 1,000 1,000
Stage 2 52,725 26,400 17,625 13,225 10,600 5,325 2,700 1,825 1,400 1,125 600 350 367 250 250
1.5 Stage 1 141,000 70,600 47,100 35,400 28,400 14,300 7,200 4,900 3,700 3,000 1,600 1,000 750 700 700
Stage 2 35,250 17,650 11,775 8,850 7,100 3,575 1,800 1,225 925 750 533 250 188 175 175
2 Stage 1 42,800 21,400 14,300 10,800 8,600 4,400 2,200 1,500 1,150 950 500 300 250 250 250
Stage 2 10,700 5,350 3,575 2,700 2,150 1,100 550 375 288 238 125 75 63 44 63
2.5 Stage 1 22,400 11,200 7,500 5,700 4,500 2,300 1,200 800 600 500 300 200 150 150 150
Stage 2 5,600 2,800 1,875 1,006 1,125 575 212 141 150 88 33 2 26 17 26
3 Stage 1 14,600 7,300 4,900 3,700 3,000 1,500 800 550 400 300 200 100 100 100 100
Stage 2 2,576 1,288 865 653 529 265 89 61 71 100 11 43 18 18 25
  • Sizes are calculated for detection of 80% power at a significance level of 1 × 10−5 for Stage 1 and a combined Stages 1 and 2 significance level of 5 × 10−8. Estimates are based on assumptions of 1:1 case:control ratios and a population risk of 0.05. RAF, risk allele frequency.
Table III. Sample size requirements for a range of RAFs and ORs
RAF
OR 0.001 0.002 0.003 0.004 0.005 0.01 0.02 0.03 0.04 0.05 0.1 0.2 0.3 0.4 0.5
1.05 >1,000,000 >1,000,000 >1,000,000 >1,000,000 >1,000,000 >1,000,000 831,100 560,000 424,600 343,400 181,700 102,600 78,500 69,000 66,600
1.1 >1,000,000 >1,000,000 >1,000,000 >1,000,000 >1,000,000 422,300 213,500 143,900 109,200 88,300 46,800 26,600 20,400 18,000 17,450
1.15 >1,000,000 954,500 637,100 478,300 383,100 192,700 97,400 65,700 49,900 40,400 21,500 12,200 9,400 8,400 8,150
1.2 >1,000,000 550,500 367,400 275,900 221,000 111,200 56,200 38,000 28,800 23,300 12,400 7,100 5,500 4,900 4,800
1.25 721,100 361,000 241,000 180,900 144,900 72,900 36,900 24,900 18,900 15,300 8,200 4,700 3,700 3,300 3,200
1.3 512,800 256,700 171,400 128,700 103,100 51,900 26,300 17,700 13,500 10,900 5,800 3,400 2,600 2,400 2,300
1.35 385,600 193,000 128,900 96,800 77,500 39,000 19,800 13,400 10,200 8,200 4,400 2,600 2,000 1,800 1,800
1.4 301,900 151,200 100,900 75,800 60,700 30,600 15,500 10,500 8,000 6,500 3,500 2,000 1,600 1,400 1,400
1.5 201,900 101,100 67,500 50,700 40,600 20,500 10,400 7,000 5,300 4,300 2,300 1,400 1,100 1,000 1,000
2 61,300 30,700 20,500 15,400 12,400 6,200 3,200 2,200 1,650 1,350 750 450 400 350 350
2.5 32,100 16,100 10,800 8,100 6,500 3,300 1,700 1,150 900 700 400 250 200 200 200
3 20,900 10,500 7,000 5,300 4,200 2,100 1,100 800 600 500 300 200 150 150 150
  • Sizes are calculated for detection of 80% power at a significance level of 5 × 10−8 for Stage 1. Estimates are based on assumptions of 1:1 case:control ratios and a population risk of 0.05. RAF, risk allele frequency.

STAGE 1

Altering significance levels for declaring success can have profound effects on sample size requirements. As a representative example, at a significance level of 1 × 10−4, approximately 120,000 cases are required to detect a risk variant with RAF 0.02 and allelic OR 1.10. This sample size increases to 150,000 when the significance threshold is decreased to 1 × 10−5, and to 215,000 when the significance threshold is 5 × 10−8. Upholding the genome-wide threshold of significance of 5 × 10−8 requires at least 50,000 cases and an equal number of controls to detect ORs of 1.15 and below even for variants with common allele frequencies (>0.05).

Figure 2A shows that for low RAFs ranging from 0.001 to 0.005, the number of cases required to detect a signal at P<1 × 10−4 is pragmatic (∼1,300 cases) for large genetic effects (OR 3.0). The identification of more commonly encountered ORs in complex traits (1.20–1.30), however, requires between 58,000 and 125,000 cases.

Large-scale consortia are now starting to surpass combined sample sizes of 100,000. Data sets of this size provide adequate power to detect association with low frequency alleles (RAF down to 0.01) and modest effect sizes of at least 1.20 at genome-wide significance thresholds. However, required sample sizes rapidly approach 200,000 cases for RAFs below 0.01. The detection of association with alleles of MAF<0.05 and a small effect size (allelic OR of 1.05) requires samples sizes of over 340,000 cases and an equal number of controls.

Appropriate sample sizes for the detection of association signals at common variants (MAF>0.05) are more readily achievable (Fig. 2C), as demonstrated by the success of GWAS meta-analyses to date. The reported median RAF of 531 signals from published GWAS studies of common diseases [Hindorff et al., 2009] was 0.36, interquartile range (IQR) 0.21–0.53, and the median OR was 1.33, IQR 1.12–1.61. The sample sizes required to obtain 80% power at genome-wide significance in order to detect these effects would be 4,118 (RAF 0.36, OR 1.33), 1,460 (RAF 0.53, OR 1.61) and 36,132 (RAF 0.21, OR 1.12), respectively (cases and controls combined).

STAGE 2

We estimated replication set sample sizes required to obtain 80% power to detect associations at α = 5 × 10−8 across the combined Stages 1 and 2 analysis, i.e. the discovery set and the replication set. When Stage 1 threshold is set to 1 × 10−4, an equal number of samples are required in Stage 2 to replicate Stage 1 signals for a final genome-wide threshold of 5 × 10−8 (Fig. 2B and D) across all ranges of allele frequencies and ORs.

If the significance threshold for Stage 1 is decreased to 1 × 10−5, the second stage samples sizes required to obtain a final P-value of 5 × 10−8 are lower. However, the first stage requires a larger sample size to surpass the initial significance threshold (Fig. 3B and D) and the combined sample sizes remain similar for both Stage 1 thresholds. For example, detection of a signal with OR 1.25 and RAF 0.05 at α = 1 × 10−5 requires 10,700 cases (and an equal number of controls) in Stage 1, and 2,675 cases in Stage 2 for a combined α = 5 × 10−8. If Stage 1 significance threshold was set to 1 × 10−4, the respective sample sizes would be 8,700 in Stage 1 and 7,118 in Stage 2.

Importantly, these calculations were based on the assumption of no heterogeneity between the discovery and replication sets, and we assumed that the replication data set was drawn from the same population as Stage 1 samples.

HETEROGENEITY

The relationship between sample size, RAF, and OR in the presence of heterogeneity is consistent with that derived from power calculations carried out in the absence of heterogeneity (Table IV). We observe that, as expected, a larger sample size is required as heterogeneity increases. Interestingly, however, the incremental increase needed does not appear to be substantial. For example, in the frequently encountered scenario of a common variant with a modest effect size (RAF 0.30 and OR 1.25), it is necessary to increase the case sample size by ∼600 individuals when heterogeneity levels are high in comparison with that needed when all studies have similar effects. This number decreases to 400 in the presence of more modest levels of heterogeneity.

Table IV. Total number of samples (across 10 studies with equal sample size, and 1:1 case:control ratio) required at different levels of heterogeneity (measured using I2)
Allele frequency Meta-analysis OR Mean I2 Number of cases
0.30 1.25 0.11 3,121
0.21 3,515
0.45 3,639
0.76 3,766
0.05 1.25 0.11 13,768
0.26 15,253
0.50 15,674
0.82 15,971
0.01 3.00 0.05 2,314
0.17 2,455
0.50 2,629
0.80 2,852
  • Results are based on 10,000 simulations.
  • a aNumber of cases required for 80% power to detect association at a significance level of 5 × 10−8.
  • b bAll studies with ORs equal to the meta-analysis OR.

The increase in sample numbers required for lower RAFs (0.05) and the same genetic effect size of 1.25 in the presence of high heterogeneity is more pronounced (additional 2,000 cases needed to achieve the same power). However, this represents a smaller proportional increase in sample size (14%). Overall, the number of additional samples required to overcome loss of power due to heterogeneity is modest.

DISCUSSION

The issue of statistical power and sample size in genetic association studies has previously been addressed in the literature. For example, Fisher and Lewis [2008] found that required sample sizes increase dramatically with decreasing levels of linkage disequilibrium (LD) between the marker and causal variant. Yang et al. [2010a, b] demonstrated that when the disease prevalence is low (<0.10), a case-control study design with a 1:1 ratio of cases and controls is more powerful than a quantitative trait association study. Moonesinghe et al. [2008] showed that meta-analysis sample size requirements increase steeply when small genetic effects were considered and also in the presence of high levels of between-study heterogeneity. Pereira et al. [2009] demonstrated the importance of choosing the correct model for meta-analysis calculations in simulated combinations of several GWAS data sets.

Our findings indicate that, within our simulation framework parameters, the impact of heterogeneity on sample size requirements is not substantial. Although, as expected, increased sample sizes are required to counteract the effect of increasing heterogeneity, even with modest ORs of 1.25 and low allele frequencies of 0.05, only approximately 20% more samples are required to maintain the same statistical power for genome-wide significance. This observation may be underpinned by the fact that high levels of heterogeneity in our study were attained by considering a wide range of ORs among the 10 studies contributing to the meta-analysis, including easily detectable large effect sizes. There are many parameters one can vary in simulation experiments and each can have implications in the downstream results. In our simulations, we assume a meta-analysis of studies with equal sample size. For scenarios in which studies have different sample sizes, results may vary from those presented here. In the presence of high heterogeneity, the number of samples needed is expected to depend on the sample size of the studies with the largest effect sizes. The required sample sizes may increase comparatively to those observed in our simulation study if the sample sizes of the studies with the largest effects decrease.

In the presence of substantial allelic heterogeneity, an alternative approach would be to undertake random-effects analysis which does not make an assumption of homogeneity across studies, in which case the sample sizes required may deviate from the numbers presented here. The level of heterogeneity observed in a meta-analysis should in any case be taken into account, and results interpreted with caution when the observed heterogeneity is high.

Sample size and power have been at the heart of genetic association study design for several decades. Recent advances in the field have made genome-wide scans possible and have ushered in a new era of successful complex disease locus identification. The power constraints of studies conducted thus far have led to the discovery of associations with common-frequency variants of large or modest effect size. Indeed, meta-analyses of GWAS to date have, most probably, identified the vast majority if not all risk variants with moderate effects and common allele frequencies. The formation of large-scale international consortia is now enabling the accrual of sample sizes surpassing 100,000. For example, a recent combined analysis across ∼184,000 individuals identified ∼180 loci affecting human height variation. As with the vast majority of other complex traits, the proportion of phenotypic variation explained by the combined set of established signals is less than 10% [Lango Allen et al., 2010]. This recurrent observation indicates that further genetic determinants of complex traits remain unidentified. Their allele frequency and effect size spectra could be diverse, with multiple common-frequency variants of small effect size, and/or low-frequency and rare variants of modest/larger effect size contributing towards this missing heritability. Using comprehensive power calculations, we find that the detection and replication of signals within these RAF and OR constraints require large sample sizes (currently achievable through large-scale collaborative efforts) in order to reach genome-wide levels of significance, and that effect size heterogeneity across meta-analyzed studies drawn from similar populations does not appear to have a profound effect on sample size requirements.

ACKNOWLEDGMENTS

The authors thank Will Rayner for help with data management and John Ioannidis for commenting on the manuscript. J.A. and E.Z. are supported by the Wellcome Trust (WT088885/Z/09/Z), K.C. is supported by a Botnar Fellowship and by the Wellcome Trust (WT079557MA), and A.M. is supported by the Wellcome Trust (WT081682/Z/06/Z).

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.