The positives, protocols, and perils of genome-wide association†
Please cite this article as follows: Neale BM, Purcell S. 2008. The Positives, Protocols, and Perils of Genome-Wide Association. Am J Med Genet Part B 147B:1288–1294.
Abstract
Genome-wide association aims to comprehensively survey genetic variation for the purposes of disease and trait mapping. We provide a brief history of the development of genetic technology necessary to realize genome-wide association. From there we identify and review the publicly available resources for conducting such work including the molecular technologies, genomic databases, and analytic tools. Following on from the analytic tools, we highlight common analytic considerations, ranging from study design, quality control, and data cleaning to association analysis and replication. We conclude with a look toward future developments such as the analysis of copy number variation and integration of expression and epigenetic phenomenon into genome-wide association. © 2008 Wiley-Liss, Inc.
INTRODUCTION
Genome-wide association studies, now applied to a large range of human diseases and traits, are designed to comprehensively survey common genetic variation. The goal is to detect phenotypic associations of modest effect that would have eluded previous linkage and candidate gene approaches. Utilizing new genotyping technologies and genomic resources such as the HapMap [International Hapmap Consortium, 2005], a number of whole genome association studies have identified convincing and replicable disease loci for common diseases [Rioux et al., 2007; Saxena et al., 2007]. The approach looks set to accelerate gene discovery across a range of fields, including neuropsychiatric genetics.
A modern whole genome study typically involves genotyping hundreds of thousands of single nucleotide polymorphisms (SNPs) in thousands of individuals. Although genotyping at this density (on the order of a SNP per 5–10 kb) represents only a small proportion of the total number of known SNPs, it captures the majority of all common genetic variation, as we describe below, due to the extensive correlation between SNPs (linkage disequilibrium, LD). In a sufficiently large sample, this whole genome association study design promises the most extensive look at the genome for uncovering common variation predisposing to disease.
In this article, we briefly describe the history of genome-wide association studies (GWAS, also termed whole genome association studies, WGAS), followed by a review of some currently available resources, including molecular technologies, genomic databases, and analytic tools. We outline some key analytic considerations, such as study design, quality control and data cleaning, analysis and replication. Finally, we look to future developments such as copy number variation (CNV), total coverage and sequencing.
DEVELOPMENTS LEADING TO WHOLE GENOME STUDIES
Large-scale genomic projects paved the way for the shift from candidate gene association to GWAS by cataloguing and understanding genetic variation. Three main projects were critical: the Human Genome Project (HGP), the SNP Consortium and the International HapMap Project (HapMap) [Lander et al., 2001; Sachidanandam et al., 2001; International HapMap Consortium, 2005]. The HGP provides a consensus sequence, which dramatically enhanced the efforts of the SNP Consortium for SNP discovery. With the vast database of identified SNPs, the HapMap project embarked on identifying LD information enabling further development of cost effective genotyping platforms.
The proportion of human variation that needs to be captured for a study to classified as a GWAS is open for debate [Barrett and Cardon, 2006]. For the purposes of this article, a GWAS is required to have genotyped at least 80,000 SNPs or the majority of known non-synonymous variation. The earliest attempts at GWAS were not SNP chip based, but rather high-throughput genotyping of approximately 80,000 gene-centric variants from Yusuke Nakamura's lab [Ohnishi et al., 2001]. This group has published GWAS on myocardial infarction, nephropathy and Crohn's disease [Ozaki et al., 2002; Ohtsubo et al., 2005; Yamazaki et al., 2005]. However, the setup required to execute such a system is extensive and expensive. The subsequent development of comparatively cheap genotyping technologies with little to no overhead required made GWAS readily available, particularly if the investigator is willing to outsource genotyping.
The first major success story of 100K SNP chip GWAS is age-related macular degeneration (AMD), with the identification of variation in the complement factor H gene [Klein et al., 2005]. Aside from AMD, the use of 100K SNP chips identified variation in NOS1AP (a.k.a. CAPON) influencing QT interval on an electrocardiogram [Arking et al., 2006]. Both of these findings showed significant replication from a number of additional studies, and are almost certainly true associations [Edwards et al., 2005; Hageman et al., 2005; Haines et al., 2005; Arking et al., 2006; Maller et al., 2006; Post et al., 2007]. The rapid success of mapping a significant percentage (∼25%) of the risk factors for AMD has not been borne out by other diseases. However, a much smaller fraction of the risk factors for many other diseases have been identified (e.g., types I and II Diabetes, Crohn's disease, prostate and breast cancer).
Since these initial studies, a number of other groups have proceeded with GWAS. Efforts on obesity [Herbert et al., 2006], Parkinson's disease [Maraganore et al., 2005], type 2 diabetes [Saxena et al., 2007; Scott et al., 2007; Sladek et al., 2007; Steinthorsdottir et al., 2007], prostate cancer [Gudmundsson et al., 2007; Yeager et al., 2007], Crohn's disease [Rioux et al., 2007], and breast cancer [Easton et al., 2007] have been published.
Two major initiatives are generating genome-wide association data: the Wellcome Trust Case Control Consortium (WTCCC) and the Genetic Association Information Network (GAIN). The WTCCC is a UK study comprised of 2,000 case sample cohorts for each of the following diseases: tuberculosis, coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, Crohn's disease, bipolar disorder and hypertension, along with a 3,000 individual shared control sample. The control genotypes are already available at www.wtccc.org.uk and the case genotypes will be made publicly available. Initial results for these scans have recently been published, showing promising results for many, though not all of the disease phenotypes examined [Easton et al., 2007; Frayling et al., 2007; Parkes et al., 2007; Samani et al., 2007; Todd et al., 2007; Wellcome Trust Case Control Consortium, 2007; Zeggini et al., 2007]. In particular, the bipolar scan has shown little in the way of true association, indicating that psychiatric disease may prove more difficult than metabolic disorders. Similarly, a recent genome-wide association scan of bipolar disorder by Sklar and colleagues did not show consistent results with the WTCCC study, indicating that the effect size for risk variation for bipolar is likely to be modest 2008. GAIN is a United States' National Institutes for Health initiative, generating genotypes on approximately 600K markers for schizophrenia, bipolar disorder, diabetic nephropathy, ADHD, major depression and psoriasis. More information about GAIN can be found at http://www.fnih.org/gain2/home_new.shtml.
RESOURCES
Numerous resources are available to aid whole genome studies, many of which were initially developed for linkage mapping, or have arisen from the HGP and HapMap. Here we present a brief list of some of these resources: further information is available at the websites noted.
SNP Chips
The commercial, technological development of SNP chips has been critical in the development of GWAS. These technologies allow for hundreds of thousands of genotypes per individual to be rapidly and affordably measured. Currently, Affymetrix and Illumina produce genome-wide arrays; Perlegen also provides genotyping, notably for GAIN. Both Affymetrix and Illumina have developed chips to genotype approximately one million SNPs; these products also provide CNV information (see Future Directions Section). More information about these products can be found at www.affymetrix.com and www.illumina.com. The true genomic coverage of these products is considerably greater than merely the number of SNPs because of the LD patterns. Briefly, LD is the non-random assortment of alleles within the population. One consequence of LD is that typing all variation in the genome is unnecessary as SNPs provide information for other loci. Already, the patterns from the HapMap are being used to test SNPs in a multi-marker framework [de Bakker et al., 2005; Pe'er et al., 2006] or to impute unknown SNPs [Marchini et al., 2007]. Generally, Illumina coverage tends to be slightly deeper because of the utilization of HapMap LD information.
Other studies have employed DNA pooling methodologies to reduce costs, estimating allele frequencies in cases and controls rather than individual genotyping. Examples of this approach have been published for nicotine dependence [Bierut et al., 2007; Uhl et al., 2007], bipolar disorder [Baum et al., 2008], osteoarthritis [Abel et al., 2006], supranuclear palsy [Melquist et al., 2007], and lung cancer [Spinola et al., 2007]. Other studies have focused only on non-synonymous variation at a genome-wide level: for example, Crohn's disease [Hampe et al., 2007], type 1 diabetes [Smyth et al., 2006], and Alzheimer's [Grupe et al., 2007].
Online Resources
A number of internet resources provide information for accessing and understanding the results of GWAS. The National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/) hosts a number of relevant resources, such as dbGAP, which hosts GWAS genotypes and results including GAIN. Such online databases, particularly when linked to existing resources such as PubMed (a searchable index of publications), GenBank (a genetic sequence database) and Entrez (a search engine for nucleotide, protein, structure, taxonomy, genome, expression, and chemical databases) will provide a powerful means to store, share, and mine GWAS data. The University of California at Santa Cruz hosts a genome browser at http://genome.ucsc.edu/cgi-bin/hggateway tailored more for comparative genomics. HapMap also provides a genome browser, annotated with LD information, useful for the identification of tagging SNPs and correlation across associated regions (www.hapmap.org).
A number of shared controls sets are available on the internet. The WTCCC provides their control dataset (pending an application process). All of the GAIN controls will be made available (but with use in publication delayed until after a nine month proprietary period as well as an application process). The Coriell Institute (https://queue.coriell.org/q) provides a case/control study of amyotrophic lateral sclerosis [Schymick et al., 2007].
Collaboration and Consortia
Essential for the success of GWAS is increasing sample size to detect variants of small effect. The WTCCC is an excellent example of collaboration with this aim. Benefits of such collaboration include the pooling of case samples from across the UK as well as drawing from the experience of analysts, genetics, and clinicians on major collections [Wellcome Trust Case Control Consortium, 2007]. Additionally, ongoing collaboration between the major Type II Diabetes projects (DGI, FUSION, and Novartis samples) [Saxena et al., 2007] has dramatically improved the detection of risk alleles. Similarly, the GAIN initiative is also comprised of collaborative efforts such as the International Multi-centre ADHD Genetics (IMAGE) project [Brookes et al., 2006; Kuntsi et al., 2006], the Major Depression Disorder project, and the Bipolar and Schizophrenia work.
Software
One difficult and important aspect of GWAS is managing the data. Given a sample size of five thousand individuals, with a million SNPs, such datasets contain five billion genotype datapoints. This presents a computational as well as a statistical burden of multiple testing. Adding multiple phenotypes, covariates and modifiers to the basic analysis adds further burden.
One approach is to use general statistical packages such as R, Stata, and SAS, which offer extensive statistical tests and models, but more limited genetic analyses (e.g., support for family-based studies, or haplotype analysis, for example). An R package, snpMatrix, is available to handle GWAS data and perform basic tests [Clayton and Leung, 2007]. A number of computational tools have been developed specifically for large-scale or whole-genome association studies: PLINK (www.pngu.mgh.harvard.edu/∼purcell/plink) [Purcell et al., 2007]; PBAT (http://www.biostat.harvard.edu/∼clange/default.htm) [Lange et al., 2004; Van Steen and Lange, 2005]; snptest (http://www.stats.ox.ac.uk/∼marchini/software/gwas/snptest.html) [Marchini et al., 2007; Wellcome Trust Case Control Consortium, 2007]; and EIGENSTRAT/EIGENSOFT (http://genepath.med.harvard.edu/∼reich/eigenstrat.htm) [Patterson et al., 2006; Price et al., 2006]. Haploview 4.0 (http://www.broad.mit.edu/mpg/haploview/) [Barrett et al., 2005] has been extended to provide a browser for GWAS results integrated with PLINK; it also will download the HapMap data to generate LD and tagging information for a specific region of the genome.
ANALYTIC CONSIDERATIONS
Study Design
Many aspects of WGAS study design are similar to candidate gene association analysis. Both case–control and family-based association study designs can be employed. Thus far, most WGAS are case–control, primarily because of the increased power per genotype compared to family-based designs [McGinnis et al., 2002]. Good experimental procedure such as randomization of case and controls across plates are important to protect against bias. Matching of controls to the cases, with a particular focus on ancestry is recommended. The magnitude of WGAS datasets brings some other study design issues into play, however. One is the utility of multi-stage designs, which have been suggested as an approach to control costs [Van den Oord, 1999; Skol et al., 2006], although the relative costs of different genotyping platforms are constantly changing. Because these datasets are expensive to collect and a fixed marker set is employed (for a given genotyping platform), the idea of using shared control datasets is both desirable and feasible. This factor also brings some difficult challenges however: the ability to ensure consistency across different samples, genotype calling algorithms and/or laboratory procedures; the trade-off in terms of power and false positives between adding increasingly less well-matched controls; the interpretation of replication if two studies use different case samples but the same control sample [Hamer and Sirota, 2000].
Quality Control
Ensuring the quality of the genotype data from GWAS is essential for drawing accurate conclusions from association analysis. Considering a dataset of a million SNPs, if only 0.5% of the SNPs are systematically biased assays, this still corresponds to 5,000 biased tests, potentially yielding an unacceptable false positive rate. To control for such pitfalls, data quality thresholds are applied. In general, the key motivation behind quality control is that as the prior probability of a SNP showing true significance is low, discarding SNPs for reasons such as missingness, minor allele frequency, mendelization errors, and Hardy–Weinberg disequilibrium, is unlikely to remove true associations. Many of the cleaning quality metrics described below are consistent with previous WGAS [Saxena et al., 2007; Wellcome Trust Case Control Consortium, 2007] and review of good experimental procedure for such studies [Manolio et al., 2007]. A tension between genotype information and controlling for bias still exists, but with procedures such as imputation, such concerns are assuaged [Marchini et al., 2007].
A good indicator of genotype probe performance for SNP chips is the call rate across the sample. We recommend examining the distribution of missingness across the sample to identify problematic SNPs. In addition to a global missing threshold, comparing missingness between cases and controls, via a Chi square, is suggested. Similar considerations for the level of genotyping of each individual are also recommended, as low genotyping rate is a marker for poor DNA quality. As an example of problem of missingness, the second highest SNP from the AMD GWAS, rs10272438, was a false positive due to missingness [Klein et al., 2005]. Approximately 15% of the genotypes failed which when genotyped using another technology showed no association. In fact, differential missing rates between cases and controls can induce false positive association [Clayton et al., 2005].
Another key measure of the quality of the genotypes is reproducibility, as assessed through intentional sample duplication. For example, HapMap samples can be used to generate quality control metrics based on sample concordance with the existing genotyping. Additionally, HapMap individuals are uniquely identifiable, and so can act as positive controls for potential laboratory mishandling (e.g., plate orientation). If a family-based design is adopted, then Mendelian checks also provide a first pass at sample integrity. As some random errors are generally expected, the thresholds for Mendelian inconsistencies and sample duplication mismatch tend to be less conservative, such that the probability of observing the number of errors is unlikely to be due to chance.
Potential batch effects are also important to examine. Often times, all samples are not done with the same product at the same time suggesting the possibility of batch effects. Considering the availability of shared control sets, such phenomena are commonplace for WGAS. Other lines of enquiry for batch effects include: different DNA sources (e.g., blood vs. buccal vs. saliva), different extraction techniques, different centers contributing DNA, different technical procedures, or plate effects. A look at the data in chronological genotyping order may also yield insight into potential sources of error, as stock changes in the lab may prove important.
Minor allele frequency (MAF) thresholds are also recommended as many studies do not have the power to detect significant association for very rare variation. Of course, the MAF threshold is dependent on the sample size, but a decent rule of thumb is observing at least 20–30 copies of the minor allele in the total sample. Current genotyping calling algorithms rely on clustering points on an intensity scale, and so rare genotypes are also more prone to error (e.g., it is difficult to define a cluster with only one observation). Comparing observed genotype frequencies in controls against the HapMap allele frequency can also provide evidence for bias.
Testing for deviation from Hardy–Weinberg equilibrium (HWE) may provide further information about the validity of the genotypes from a SNP. However, such endeavors are confounded by both population stratification and true association signal. Therefore, markers passing all criteria except for HWE ought to be considered carefully rather than discarded out of hand. Another approach is to define a more stringent threshold, such as 0.000001 for deviation from HWE. HWE tests can be calculated on only the controls or in the entire sample. The justification for considering only the controls for HWE is that positive association may confound the HWE test [Sham, 1997].
Beyond these initial cleaning techniques, further checks for family structure are suggested in the case of family-based data. Non-paternity is a potential problem for trio and sibship designs, which can be easily detected by looking at identity-by-state (IBS). For case–control designs, the same IBS information can be used to determine identity-by-descent (IBD) information across the sample (see Purcell et al. 2007 for more details on the relationship between IBS and IBD at a population level). Examining both IBS and IBD information can identify sample mix-up (via different IBD patterns), cryptic relatedness (high IBD sharing), and sample contamination (excess heterozygosity and IBD).
For case–control and population-based quantitative analysis, population stratification is a key potential confounding factor. With whole-genome association data, however, the ability to identify population structure is dramatically improved. PLINK includes routines to cluster individuals based on IBS sharing for population classification. Aside from assigning individuals to clusters, a correction to the inflation of the association statistic can be applied by principal components analysis [Price et al., 2006].
For a given associated SNP, it is worthwhile to see whether nearby SNPs or haplotypes that are correlated with the variant also show association with disease; if the associated SNP is rare or has a high missing rate, confirming that the association is also seen with haplotypes formed by common, high genotyping SNPs is, whenever possible, desirable. A SNP that shows a strong association but for which all the correlated, neighboring variants are not associated, is more likely to represent an artifact.
As a final check, the distribution of association test statistics is a useful indicator for sources of bias. Gross enrichment of the distribution of the association evidence is a hallmark sign of bias. Furthermore, extremely significant P-values, such at 10−60 are more likely than not due to batch effects, non-random missingness or data-handling errors. For further information about data cleaning considerations, we recommend a recent feature in Nature from NCI-NHGRI [Chanock et al., 2007] and the WTCCC manuscript [Wellcome Trust Case Control Consortium, 2007].
ANALYSIS
Three most common analytic techniques for case–control analysis are the χ2 test of allele counts, trend tests (where a multiplicative model is assumed for the regression based on genotype category, coded as 0, 1, and 2), and a 2 degree of freedom genotypic model (where one genotype category is assumed as baseline and the effects of the other two categories are modeled). For family-based analysis, the TDT [Spielman et al., 1993] for trios and the sib-TDT [Spielman and Ewens, 1998] (using siblings discordant for disease) are obvious choices. Quantitative methods include regression models for population-data, following the similar parameterization as the case–control, while quantitative approaches have been developed for families [Rabinowitz, 1997; Fulker et al., 1999; Lange et al., 2004].
As well as testing directly genotyped SNPs, consideration of haplotype structure enables one to test ungenotyped variation. One approach would be to specify haplotypes based on sliding windows of SNPs, or on haplotype blocks based on the LD structure of the observed data. An alternate approach is to use information from the HapMap to specify more precise haplotype tests specifically for the HapMap SNPs that were not directly genotyped in the study. For example, for a fixed genotyping platform, Pe'er et al. 2006 compiled lists of single SNPs and two and three SNP haplotypes that are in strong LD with ungenotyped HapMap SNPs.
Beyond these initial tests, a number of other techniques are frequently employed. Based on the tagging information from HapMap, tests of two and three marker haplotypes which are proxies for known variants can be conducted. Furthermore, different imputation methods are being developed to generate genotypes at untyped loci jointly with information from a reference panel based on LD patterns in the HapMap and further untyped variation based on ancestral recombination graphs [Marchini et al., 2007]. The benefit of imputation is still to be fully evaluated: with increasing chip densities, the majority of common variation may well be directly captured. Perhaps one particularly useful application of imputation will be to reconcile results and merge data for WGAS studies that have used different genotyping platforms. Finally, multi-marker tests that consider whole pathways and genes simultaneously, instead of single variants, are another area of promise.
All of the above methods fall broadly under traditional association analyses and are targeted at the common diseases/common variant hypothesis (CDCV), that variation predisposing to disease within the population will be common within the population and of modest effect. In contrast, the multiple rare variant (MRV) hypothesis states that variation predisposing to disease is rare and of small to modest effect (with the extreme example being that every case for a given disease has a set of private mutations). In all likelihood, both the CDCV and the MRV are likely to be true for the etiology of common disease within the population. How WGAS studies of common SNPs will fare when the MRV is true for a substantial proportion of the genetic variation for a particular disease is unclear. New methods and models are being developed that might partially address this problem. For example, comparing LD information between cases and controls may shed insight on rare variation [Zaykin et al., 2006]. Alternatively, using WGAS data, one might look for regions of increased ancestral sharing between cases, as individuals sharing the same rare variant are also likely to share an extended, surrounding region [Purcell et al., 2007]. Homozygosity and admixture analyses are additional lines of enquiry for the mapping of risk-conferring variation [Lander and Green, 1987; Reich and Patterson, 2005]. Ultimately, sequence data will likely become routinely available, to complement common polymorphism data and drive the investigation of rare variation.
Multiple Testing
The number of association tests for WGAS is staggering. Standard approaches for multiple testing including Bonferroni and False Discovery Rate (FDR) can be used to control the error rate of the study. For family-based association, one potential analytic possibility is to condition on the between family information to select SNPs for the within family test, to reduce the necessary multiple testing burden [Lange et al., 2004]. By selecting, the necessary number of SNPs for genome-wide significance under Bonferroni is reduced to the number of SNPs analyzed in the within test. However, as the between and within information are independent, it may be more efficient to combine the evidence for association [Skol et al., 2006]. Risch and Merikangas 1996 proposed a threshold of 10−6 based on the number of known SNPs at the time, though a more realistic threshold is perhaps on the order of 10−7 or even 10−8 assuming approximately a million testable variants using the Šidák Correction [Šidák, 1967]. Permutation analysis is also an avenue for generating an appropriate experiment-wide P-values, but such efforts may not appropriately control for all SNPs potentially tested.
Replication and follow-up studies are essential for determining whether identified variants are true or false positives (although it is worth remembering that if hundreds or even thousands of SNPs are followed up, then a predictable proportion will replicate purely by chance also). With replication and follow-up come the difficulties of meta-analysis. Ideally, data sharing is encouraged to provide maximal information about the association evidence. If this is not possible, then combining evidence based on the direction and magnitude of the effect is encouraged. As a last resort, Fisher's combination of P-values can be utilized.
FUTURE DIRECTIONS
Recently, a coalition of clinicians, geneticists, and analysts have formed the Psychiatric GWAS Consortium (PGC) [The Psychiatric GWAS Consortium, submitted], which aims to encourage data-sharing and perform a comprehensive meta-analysis of genome-wide association studies of psychiatric disease. The current focus is on ADHD, autism, bipolar, major depression, and schizophrenia, looking both within and across disorders. In total, there will be in excess of 25 billion genotypes for meta-analysis, representing the largest genetic study in psychiatry ever conducted.
Genetics as a field continues to develop technologies for studying the human genome at finer and finer scales. The most recent SNP Chip technologies provide some insight into CNVs. CNVs are loosely defined as approximately 1 kb or longer regions of the genome which show variation in the number of copies as compared to a given reference sequence [Feuk et al., 2006]. Already a handful of studies have been published on the effects of CNVs on gene expression and phenotypes [McCarroll et al., 2006; Sebat et al., 2007; Stranger et al., 2007; Wong et al., 2007].
Complete coverage of the genome with respect to LD and eventually full sequence information will be available for analysis. Such extensive information will require even more careful data management. However, many of the existing tools for analysis can be applied to such data. Sequencing enables examination of rarer variation as a potential cause of disease. The analysis of such variation will likely require the development of new statistical models. In addition to identifying the genetic code, expression and epigenetic information will also reduce in cost. For an excellent review of global gene expression see Rockman and Kruglyak 2006, encompassing the genetics of global gene expression thus far, features of regulatory sequence variation, and genomic effects such as cis-acting, trans-acting, cis-regulatory, and protein-coding on gene expression. Epigenetics examines DNA structure (e.g., histone placement) and methylation patterns; for a review see van Vliet et al. 2007.
CONCLUSIONS
WGAS promise the most extensive look at the genome for uncovering variation predisposing to disease. Technology will continue to develop yielding a wealth of data for identifying the etiology of disease. While WGAS will not identify all of the genetic factors, new biochemical pathways will be identified for investigation. For many diseases, which are known to be strongly heritable, finding even one or two true disease genes could potentially transform the research in that disease area, even if the majority of genetic determinants elude detection in that particular study. Given the difficulty of mapping genetic variation for neuropsychiatric disease, even greater care is necessary for successful association mapping.