Assessing the Enrichment Performance in Targeted Resequencing Experiments†
Communicated Graham R. Taylor
Abstract
Target enrichment strategies are a very common approach to sequence a predefined part of an individual's genome using second-generation sequencing technologies. While highly dependent on the technology and the target sequences selected, the performance of the various assays is also variable between samples and is influenced by the way how the libraries are handled in the laboratory. Here, we show how to find detailed information about the enrichment performance using a novel software package called NGSrich, which we developed as a part of a whole-exome resequencing pipeline in a medium-sized genomics center. Our software is suitable for high-throughput use and the results can be shared using HTML and a web server. Finally, we have sequenced exome-enriched DNA libraries of 18 human individuals using three different enrichment products and used our new software for a comparative analysis of their performance. Hum Mutat 33:635–641, 2012. © 2012 Wiley Periodicals, Inc.
Introduction
Since next-generation sequencing technologies were launched just a few years ago, systematic resequencing of the human genome in larger cohhorts has found its way into small-sized and medium-sized research centers. As whole-genome sequencing is still comparatively expensive, target enrichment strategies are currently a very common technique in ongoing resequencing studies. In traditional Sanger sequencing, the target regions on the genome used to be enriched by PCR, which is not feasible in terms of cost and labor input when larger genomic regions are supposed to be selected. Microdroplet-based PCR [Tewhey et al., 2009] enrichment enables a larger panel of DNA target regions to be selected, although this technique is still far from being suitable for exome-scale studies. Higher throughput can be achieved, for example, through enrichment by hybridization on high-density microarrays [Albert et al., 2007] or in-solution hybridization [Gnirke et al., 2009], enabling whole-exome capture and a diverse degree of automation. Each of these technologies has its issues and technical peculiarities, which have been reviewed by Mamanova et al. (2010). Array-based and in-solution capture methods of the previous generation were compared in a recent study [Kiialainen et al., 2011].
As the performance of each of the aforementioned assays is also highly variable from sample to sample, enrichment parameters are very important quality criteria to be assessed after sequencing. On the other hand, there is no out-of-the-box software solution to this computationally intensive task, making people define their own standards and put efforts into the development of software, which often remains at an immature stage. To fill this gap, we have developed a methodological standard for the evaluation of the performance of target enrichment in next-generation sequencing experiments. We provide a software named NGSrich to apply our methods in a high-throughput fashion as a part of an integrated laboratory pipeline (Fig. 1). As the near future will also bring sequencers of the second generation into smaller laboratories, the rapid analysis of the enrichment performance for different samples and enrichment techniques also needs to become achievable without the intensive help of bioinformatics experts.

Output reports produced by the NGSrich software after the analysis of 18 whole-exome-enriched human DNAs using three different technologies: (A) report for sample IL1 (Illumina TruSeq Exome) with per chromosome bar plots for a quick overview of the coverage across the genome and a pie chart giving the fractions of target regions covered to a particular average depth and (B) summary report across the total six exomes enriched with TruSeq technology.
As an intriguing application of our software, we show the results of a comparison study of the quality and features of three enrichment products, which have recently been launched by leading companies and established in many sequencing laboratories around the globe (Fig. 2).

The target enrichment performance for 18 human exomes compared across three different technologies using (A) the TPKM statistic, (B) the percentage of reads in the target regions or 100 bp upstream or downstream among those that can be aligned anywhere on the genome, and (C) the number of genes that are covered less than 2×.
Materials and Methods
Statistical Framework


To quantify the success of the enrichment procedure, we evaluate the percentage of target bases m ∈ Gj for which cm exceeds a certain threshold and compare the mean coverage C(Gj) across the target regions. In this perspective, genes with very low or very high coverage, that is, those which are most affected by the unevenness of the enrichment procedure across the exome or target region, are of particular interest. Furthermore, to give evidence on the fraction of data that is usable at all, we calculate the percentage of reads that align to any target region Gj. As many on-target reads usually map across the target boundaries, we include the reads aligning 100 bp upstream or downstream of the target region.
For a target interval Gj, we define an enrichment factor as the ratio C(Gj)/E between the average coverage in the respective target region [C(Gj) as defined above] and the average coverage expected without any enrichment (E = LR/g, where L is the read length, and g is the genome size in base pairs). We assess the uniformity over a particular target region graphically by histogram plots.

Implementation
The NGSrich pipeline was mainly written in Java and requires an installation of the Java Runtime Environment and the R package for statistical computing. Details on the parameters are accessible at the Linux command line and in manual files distributed together with the package. The starting point for the NGSrich pipeline is the specification of the target regions as a BED file as typically provided by the kit manufacturers and a read alignment file in binary alignment/map (BAM) or sequence alignment/map (SAM) format, which now form a de facto standard for the storage of aligned DNA sequencing data [Li et al., 2009]. The output is given as an HTML-based summary report for convenient access with a standard web browser. In addition, target-specific coverage information given in BED format and wiggle files appropriate for visualization in a genome browser are placed in a subdirectory. As sequencing experiments on multiple samples are a very common application, a project summary report providing summary statistics and links to the sample-specific reports can be created with an additional module.
Integration into a Computational Pipeline
We developed the NGSrich software as a part of a computational pipeline used for high-throughput analysis of target-enriched resequencing experiments (Fig. 3). A compute cluster and the ELANDv2 software are used to run fully automated reference alignments operated by a relational database supplied with sample and species information. Variants called with CASAVA v1.8, SAMtools v0.1.17, and Dindel v1.01 [Albers et al., 2011] are also organized in this database and evaluated with further (yet unpublished) analysis scripts. Detailed reports on the target enrichment performance produced by NGSrich and reports from basic quality checks of the raw data using FastQC (available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) are copied to a local web server. Using an efficient user management, these can be accessed by a predetermined group of partners defined by the pipeline operators. Our analysis system has been successfully used for processing of more than 400 exomes and targeted resequencing studies [e.g., Huebner et al., 2011].

The NGSrich software forms an essential step in a whole-exome resequencing pipeline tailored to an individual hardware infrastructure of a medium-sized genomics center. The alignment and variant calling steps are run on a compute cluster using information from a relational database, and NGSrich and FastQC are used to assess quality parameters of the experiments. All data can be accessed over an integrated web server.
Software Availability
The NGSrich 64-bit binaries and source code can be accessed at http://sourceforge.net/projects/ngsrich/files under the General Public License (GPL3).
Next-Generation DNA Sequencing and Alignment
DNA was extracted from peripheral blood of 18 individuals and randomly fragmented using sonification technology (Covaris, Inc., Woburn, MA). The fragments were end repaired and adapters were ligated. After library preparation, the samples were submitted to the enrichment process. For six samples (AG1 to AG6), we used SureSelect Human AllExon 50Mb technology following the SureSelect Target Enrichment for Illumina Paired End Multiplexed Sequencing Protocol, version 1.2, January 2011 (Agilent Technologies, Santa Clara, CA). For another six samples (NG1 to NG6), we used SeqCap EZ Human ExomeLibrary v2.0 kit following the NimbleGen SeqCap EZ Exome Library SR User's Guide, the Targeted Sequencing with NimbleGen SeqCap EZ Libraries, and Illumina TruSeq DNA Sample Preparation Kits Technical Note 2011 (Roche NimbleGen Inc., Madison, WI), whereas the last six samples (IL1 to IL6) were enriched using the TruSeq Exome Enrichment Kit following the TruSeq Enrichment Guide, Part #15013230 Rev. E June 2011 (Illumina Inc., San Diego, CA). For the Agilent SureSelect data, samples AG5 and AG6 were sequenced as part of a 6-plex library, which was loaded to four lanes of a flow cell, and the remaining samples AG1 to AG4 were sequenced on one lane each. Samples NG1 and NG2 were sequenced as part of a 4-plex library on three lanes, whereas the remaining samples enriched with the NimbleGen technology were sequenced as part of a 6-plex library on four lanes. The Illumina-enriched samples IL1 to IL6 were sequenced as a 6-plex library and loaded to four lanes. For all libraries, 100 bp were sequenced from both ends of the fragments on an Illumina HiSeq 2000 (Illumina Inc., San Diego, CA) sequencing instrument. The data were filtered according to signal purity by the Illumina realtime analysis software v1.8. Subsequently, the reads were mapped to the Human Genome reference build version hg19 using the ELANDv2 alignment algorithm on a multinode compute cluster. Using CASAVA v1.8, PCR duplicates were filtered out and the output was converted to BAM format.
Results
Initially, we used our statistical framework and the NGSrich software on a high-performance compute cluster to process data from one human exome enriched by Illumina TruSeq technology. From this, we got a browsable, HTML-based output including target level details for convenient use in a server-client environment (Fig. 1). We also successfully used the software for evaluation of customized microarray-based capture assays (unpublished data).
Subsequently, we employed our software to do a systematic evaluation of three human exome enrichment techniques. Our comparison study comprises 18 exomes prepared using (1) the Agilent SureSelect Human AllExon 50Mb kits (Agilent Technologies), (2) Roche NimbleGen SeqCap EZ Human ExomeLibrary v2.0 technology (Roche NimbleGen Inc.), and (3) the Illumina TruSeq Exome Enrichment Kit. Sequencing of the respective libraries on an Illumina HiSeq 2000 sequencing instrument resulted in between 109.8 and 225.0 million reads after Illumina signal purity filtering, whereas the number of reads that could be aligned to the hg19 reference genome ranged between 64.0 and 180.5 million (Table 1).
AG1 | AG2 | AG3 | AG4 | AG5 | AG6 | ||
---|---|---|---|---|---|---|---|
(A) | # Reads (purity filtered) | 225,009,312 | 211,175,642 | 213,538,894 | 217,112,592 | 194,748,944 | 187,923,012 |
# Reads aligned | 179,770,884 | 174,950,182 | 177,255,570 | 180,457,086 | 135,669,836 | 127,382,464 | |
Percentage on target ±100 bp | 62.36 | 59.15 | 58.99 | 57.72 | 70.95 | 69.85 | |
Coverage mean | 181.19 | 167.42 | 169.35 | 167.67 | 154.72 | 142.52 | |
Coverage SD | 141.91 | 132.83 | 130.79 | 131.71 | 105.65 | 95.29 | |
Percentage covered ≥30× | 86.28 | 85.94 | 86.44 | 86.18 | 88.08 | 87.48 | |
TPKM | 12.05 | 11.43 | 11.40 | 11.15 | 13.71 | 13.50 | |
# Genes covered ≤2× | 142 | 138 | 131 | 130 | 130 | 134 | |
# Genes covered ≥200× | 7,141 | 5,181 | 5,263 | 5,046 | 3,081 | 2,124 | |
NG1 | NG2 | NG3 | NG4 | NG5 | NG6 | ||
(B) | # Reads (purity filtered) | 213,433,760 | 184,253,150 | 140,951,508 | 150,525,670 | 133,251,350 | 154,577,208 |
# Reads aligned | 153,953,510 | 124,548,136 | 102,485,108 | 95,384,682 | 69,631,082 | 127,825,842 | |
Percentage on target ±100 bp | 76.85 | 78.28 | 77.85 | 75.92 | 76.50 | 79.01 | |
Coverage mean | 211.36 | 175.11 | 143.54 | 129.76 | 95.94 | 183.05 | |
Coverage SD | 128.95 | 102.32 | 86.82 | 74.02 | 53.31 | 119.29 | |
Percentage covered ≥30× | 93.86 | 93.46 | 93.04 | 93.15 | 90.32 | 93.41 | |
TPKM | 17.37 | 17.70 | 17.60 | 17.16 | 17.30 | 17.86 | |
# Genes covered ≤2× | 131 | 132 | 117 | 130 | 134 | 125 | |
# Genes covered ≥200× | 9,147 | 5,072 | 1,312 | 581 | 106 | 5,981 | |
IL1 | IL2 | IL3 | IL4 | IL5 | IL6 | ||
(C) | # Reads (purity filtered) | 198,899,062 | 174,905,824 | 118,359,918 | 111,670,074 | 122,712,392 | 109,805,482 |
# Reads aligned | 121,015,490 | 107,662,514 | 72,118,600 | 64,008,828 | 73,591,594 | 64,613,284 | |
Percentage on target ±100 bp | 69.67 | 71.24 | 69.61 | 70.85 | 71.10 | 70.44 | |
Coverage mean | 111.35 | 101.05 | 66.48 | 59.75 | 68.98 | 60.15 | |
Coverage SD | 63.77 | 59.85 | 38.37 | 32.78 | 38.72 | 33.72 | |
Percentage covered ≥30× | 84.11 | 82.81 | 76.9 | 76.77 | 78.54 | 75.53 | |
TPKM | 11.19 | 11.44 | 11.18 | 11.37 | 11.42 | 11.31 | |
# Genes covered ≤2× | 50 | 47 | 71 | 63 | 62 | 73 | |
# Genes covered ≥200× | 568 | 411 | 118 | 53 | 116 | 79 |
- In the aligned reads count, PCR duplicate reads are already filtered out, which explains the big difference in count between raw and aligned reads.
Regarding the enrichment performance (Table 1), the average target coverage ranged from 141.64 to 178.48 (Agilent Technologies), from 93.03 to 205.23 (Roche NimbleGen Inc.), and from 59.16 to 110.25 (Illumina Inc.). The analysis also shows that the target is covered at least 30× in between 61.32% and 85.63% (Agilent Technologies), between 88.52% and 92.95% (Roche NimbleGen Inc.), and between 75.26% and 83.97% (Illumina Inc.) of the total target region length. The latter statistics highly depend on the total amount of sequence data available and the target region specifications and size, which are provided by the respective vendor. To address the variability in the number of sequences, we repeated the evaluation with a random selection of 60 million reads from each sample, which essentially confirmed that the Roche NimbleGen product performs better than the other assays (Supp. Table S1). However, as the read count generated for resequencing actually needs to account for the target size, this analysis obviously favors a product with smaller regions for target capture.
On the other hand, the percentage of reads that align in the target regions or 100 bp upstream or downstream ranged between 57.72% and 70.95% (Agilent Technologies), between 75.92% and 79.01% (Roche NimbleGen Inc.), and between 69.61% and 71.24% (Illumina Inc.). Using the TPKM statistic for a normalized comparison between different samples and library preparation techniques (see Materials and Methods), the average and standard deviation of the values are 12.2 ± 1.12 (Agilent Technologies), 17.5 ± 0.27 (Roche NimbleGen Inc.), and 11.32 ± 0.11 (Illumina Inc.). A gene-level exploration of the data shows that between 141 and 143 (Agilent Technologies), between 115 and 135 (Roche NimbleGen Inc.), and between 44 and 72 (Illumina Inc.) are covered below 2× (Fig. 2). Moreover, 34 genes were poorly enriched (covered below 2×) in all six samples prepared with Illumina technology, 101 in those prepared with Roche NimbleGen technology, and 101 in those prepared with Agilent technique, again using average gene-wise coverage below 2× as a cutoff (Supp. Table S2). Taking together the three exome specifications, there are 20,925 RefSeq genes represented in total. Among these are 737 genes, which are not included in the Agilent target (but in the other two specifications), 1,737, which are not on the Roche NimbleGen, and 599, which were not on the Illumina target (Supp. Table S3). To exclude that the results are biased by low-quality alignments, which could be due to repetitiveness or structural peculiarities of the target sequences, we repeated our analysis using only those reads that have a mapping quality value of 30 or higher. We observed that the statistics are actually very robust against this (Supp. Table S4) and conclude that on a whole-exome level, the length of unmappable regions is rather small compared with the total length of the target sequences.
The exome coverage is often considered sufficient when it is 30× or higher in at least 80% of the target sequence. Our data allow the conclusion that the amount of sequence needed to achieve this is 14 Gb for the Agilent technology, 5 Gb for Roche NimbleGen, and 9 Gb for the Illumina protocol (Table 2), implying that in a multiplex library on one HiSeq 2000 lane, two (Agilent Technologies), five (Roche NimbleGen Inc.), or three (Illumina Inc.) human exomes can be processed, depending on the enrichment product that is being used.
Target size (Mb) | Preparation time (days) | Covering 80% of target at least 30× (Gb) | Exomes per lane | |
---|---|---|---|---|
Agilent Technologies | 50 | 3 | 14 | 2 |
NimbleGen | 44 | 5 | 5 | 5 |
Illumina | 62 | 4 | 9 | 3 |
- Using the Agilent SureSelect protocol, more sequence is needed to achieve a sufficient coverage than for the other products. On the other hand, the Roche NimbleGen approach not only covers the smallest number of genes but also has got the longest library preparation times and is less flexible than the other techniques.
To benchmark the results produced by our NGSrich software against a different approach, we generated a pileup version of the respective sample BAM files using the mpileup command of the SAMtools, version 0.1.17 [Li et al., 2009]. Subsequently, we evaluated the coverage information from the pileup file using basic statistical functions in R. Strikingly, this rather direct evaluation of coverage parameters matches the results produced by the software very closely (Table 3).
AG1 | AG2 | AG3 | AG4 | AG5 | AG6 | ||
(A) | Coverage mean | 177.23 | 162.63 | 164.37 | 162.88 | 151.66 | 139.68 |
Coverage SD | 154.71 | 143.47 | 141.53 | 143.01 | 113.95 | 103.95 | |
Percentage covered ≥10× | 92.63 | 92.58 | 92.62 | 92.79 | 93.78 | 93.69 | |
Percentage covered ≥30× | 85.89 | 85.49 | 86.27 | 86.12 | 87.72 | 87.10 | |
NG1 | NG2 | NG3 | NG4 | NG5 | NG6 | ||
(B) | Coverage mean | 207.16 | 171.48 | 140.49 | 127.21 | 93.73 | 179.37 |
Coverage SD | 133.71 | 106.40 | 90.42 | 76.74 | 56.07 | 124.78 | |
Percentage covered ≥10× | 96.12 | 96.15 | 96.21 | 96.34 | 95.54 | 96.19 | |
Percentage covered ≥30× | 93.47 | 92.98 | 92.82 | 93.05 | 89.94 | 92.92 | |
IL1 | IL2 | IL3 | IL4 | IL5 | IL6 | ||
(C) | Coverage mean | 106.14 | 96.16 | 63.19 | 56.59 | 65.47 | 57.21 |
Coverage SD | 80.45 | 73.91 | 48.85 | 39.51 | 48.59 | 42.88 | |
Percentage covered ≥10× | 90.84 | 90.19 | 88.69 | 88.73 | 89.17 | 88.72 | |
Percentage covered ≥30× | 83.78 | 82.42 | 76.09 | 76.78 | 77.87 | 75.62 |
- The coverage statistics generated by this direct approach essentially confirm those produced by NGSrich. Using the software, mean and standard deviation are calculated over the mean values per target region, so a slightly aberrant value is to be expected.
Discussion
A fair cross-platform comparison needs to take into account that the definition of the exome and the total size of the target regions are variable across platforms. We have, therefore, proposed to normalize the amount of on-target sequence to a per-kilobase value of the target reference. To address variability in sequence cluster densities and other issues, we have also proposed to normalize by the total amount of sequence and express both in a statistic called TPKM (see Materials and Methods). This statistic measures the amount of sequence over the specified target region just as the RPKM statistic does for a particular transcript [Mortazavi et al., 2008]. The larger the target region, the smaller the TPKM is for a fixed percentage of reads aligning on target. On the other hand, the larger the target region, the more on-target reads are needed to achieve the same TPKM value for a fixed total number of reads. Regarding the results produced by our software for the 18 exomes, it shows up that the TPKM value is very robust across samples enriched by the same technology and therefore suitable as a characteristic measure of a particular technique.
The observation that the TPKM values are clearly larger for the NimbleGen technology than for the other two approaches reflects in part that this protocol has the smallest total length of target regions. However, the percentage of on-target reads and other statistics also show that this approach actually performs better than the other techniques under consideration. In contrast, the smallest number of genes covered insufficiently (below 2×) is achieved by the Illumina TruSeq exome, although this technology has the largest total target region length. On the other hand, the per-sample time and cost needed for library preparation will always contribute to a decision for using a specific technology. Although the Agilent SureSelect assay (Agilent Technologies) has the poorest and least robust enrichment performance, it is comparatively fast and the handling and training for the workflow are easy especially for small projects. For the NimbleGen technology, the performance is superior to the other assays but this method has the longest preparation time and is therefore less flexible than the others. Also, the size of the target region and thus the gene content is smaller than for the other assays. The Illumina technology provides the easiest workflow for large-scale projects because the samples of a multiplex library can be pooled beforehand and, consequently, the enrichment process has to be run only once. Finally, the number of exomes that can be processed on one lane has direct impact on the sequencing cost, regardless of the price paid per gigabase.
The performance of the enrichment process also depends on the structure of the target sequence, for example, repetitiveness, GC content, and other issues. The general observation that a nonnegligible fraction of the genes covered insufficiently in a specific sample is poorly enriched also in most of the other samples processed by the same product (Supp. Table S2) suggests that failure of a specific subset of the target can be considered an assay-specific characteristic. Because the starting point of our software tool is the vendor-provided target specification file, our study can only measure the assay performances against the description of the vendors rather than using the exact coordinates of the hybridization targets. The way how target regions are declared is highly variable between the products. Regarding this, a reanalysis with respect to a reference exome downloaded from the current University of California at Santa Cruz (UCSC) annotation database showed that due to its more comprehensive capture regions, the Illumina TruSeq exome has the largest fraction of this reference exome covered to at least 30× coverage (Supp. Table S5), whereas a lot of genes is missed with the NimbleGen in-solution assay (Roche NimbleGen Inc.). This also suggests that the target definition is a very important point to consider when choosing a particular enrichment product. In addition, the way in which the performance depends on structural peculiarities of the underlying sequences could be subjected to further studies. The sequence capture methods reported here have recently been compared in several other studies all of which concluded that the Roche NimbleGen SeqCap (Roche NimbleGen Inc.) produces a higher percentage of reads, which can be aligned on target than the other assays do [Asan et al., 2011; Clark et al., 2011; Parla et al., 2011; and Solunen et al., 2011]. Clark et al. (2011) also mentioned that the Illumina platform is the only one to contain untranslated regions (UTRs) and hypothesized that the superior performance of the Roche NimbleGen assay (Roche NimbleGen Inc.) is due to its higher bait density.
Our NGSrich software meets the very common demand for a detailed, summarized, and exon-wise analysis of the target enrichment performance of next-generation sequencing libraries. Combined with a scriptable operation mode and HTML-based output reports for efficient integration into a web server, our software is capable of being used with extremely high throughput. With the publication of our software, we have closed an essential gap in the availability of software tools needed to build fast and efficient computational pipeline systems for resequencing experiments.