Informatics

Full Access

Assessing the Enrichment Performance in Targeted Resequencing Experiments^†

Corresponding Author

Peter Frommolt

[email protected]

Cologne Cluster of Excellence on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Universität zu Köln, Cologne, Germany

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Cologne Cluster of Excellence on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Universität zu Köln, Cologne, Germany.Search for more papers by this author

Ali T. Abdallah,

Ali T. Abdallah

Cologne Cluster of Excellence on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Universität zu Köln, Cologne, Germany

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Janine Altmüller,

Janine Altmüller

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Susanne Motameny,

Susanne Motameny

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Holger Thiele,

Holger Thiele

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Christian Becker,

Christian Becker

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Kathryn Stemshorn,

Kathryn Stemshorn

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Matthias Fischer,

Matthias Fischer

Klinik und Poliklinik für Kinderheilkunde, Abteilung für Pädiatrische Onkologie und Hämatologie, Klinikum der Universität zu Köln, Cologne, Germany

Zentrum für Molekulare Medizin Köln (ZMMK), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Tobias Freilinger,

Tobias Freilinger

Neurologische Klinik und Poliklinik, Klinikum Großhadern, München, Germany

Search for more papers by this author

Peter Nürnberg,

Peter Nürnberg

Cologne Cluster of Excellence on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Universität zu Köln, Cologne, Germany

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Zentrum für Molekulare Medizin Köln (ZMMK), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Peter Frommolt,

Corresponding Author

Peter Frommolt

[email protected]

Cologne Cluster of Excellence on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Universität zu Köln, Cologne, Germany

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Cologne Cluster of Excellence on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Universität zu Köln, Cologne, Germany.Search for more papers by this author

Ali T. Abdallah,

Ali T. Abdallah

Cologne Cluster of Excellence on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Universität zu Köln, Cologne, Germany

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Janine Altmüller,

Janine Altmüller

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Susanne Motameny,

Susanne Motameny

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Holger Thiele,

Holger Thiele

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Christian Becker,

Christian Becker

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Kathryn Stemshorn,

Kathryn Stemshorn

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Matthias Fischer,

Matthias Fischer

Klinik und Poliklinik für Kinderheilkunde, Abteilung für Pädiatrische Onkologie und Hämatologie, Klinikum der Universität zu Köln, Cologne, Germany

Zentrum für Molekulare Medizin Köln (ZMMK), Universität zu Köln, Cologne, Germany

Search for more papers by this author

Tobias Freilinger,

Tobias Freilinger

Neurologische Klinik und Poliklinik, Klinikum Großhadern, München, Germany

Search for more papers by this author

Peter Nürnberg,

Peter Nürnberg

Cologne Cluster of Excellence on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Universität zu Köln, Cologne, Germany

Cologne Center for Genomics (CCG), Universität zu Köln, Cologne, Germany

Zentrum für Molekulare Medizin Köln (ZMMK), Universität zu Köln, Cologne, Germany

Search for more papers by this author

First published: 30 January 2012

https://doi.org/10.1002/humu.22036

Citations: 26

^†

Communicated Graham R. Taylor

Share a link

Email
Wechat
Bluesky

Abstract

Target enrichment strategies are a very common approach to sequence a predefined part of an individual's genome using second-generation sequencing technologies. While highly dependent on the technology and the target sequences selected, the performance of the various assays is also variable between samples and is influenced by the way how the libraries are handled in the laboratory. Here, we show how to find detailed information about the enrichment performance using a novel software package called NGSrich, which we developed as a part of a whole-exome resequencing pipeline in a medium-sized genomics center. Our software is suitable for high-throughput use and the results can be shared using HTML and a web server. Finally, we have sequenced exome-enriched DNA libraries of 18 human individuals using three different enrichment products and used our new software for a comparative analysis of their performance. Hum Mutat 33:635–641, 2012. © 2012 Wiley Periodicals, Inc.

Introduction

Since next-generation sequencing technologies were launched just a few years ago, systematic resequencing of the human genome in larger cohhorts has found its way into small-sized and medium-sized research centers. As whole-genome sequencing is still comparatively expensive, target enrichment strategies are currently a very common technique in ongoing resequencing studies. In traditional Sanger sequencing, the target regions on the genome used to be enriched by PCR, which is not feasible in terms of cost and labor input when larger genomic regions are supposed to be selected. Microdroplet-based PCR [Tewhey et al., 2009] enrichment enables a larger panel of DNA target regions to be selected, although this technique is still far from being suitable for exome-scale studies. Higher throughput can be achieved, for example, through enrichment by hybridization on high-density microarrays [Albert et al., 2007] or in-solution hybridization [Gnirke et al., 2009], enabling whole-exome capture and a diverse degree of automation. Each of these technologies has its issues and technical peculiarities, which have been reviewed by Mamanova et al. (2010). Array-based and in-solution capture methods of the previous generation were compared in a recent study [Kiialainen et al., 2011].

As the performance of each of the aforementioned assays is also highly variable from sample to sample, enrichment parameters are very important quality criteria to be assessed after sequencing. On the other hand, there is no out-of-the-box software solution to this computationally intensive task, making people define their own standards and put efforts into the development of software, which often remains at an immature stage. To fill this gap, we have developed a methodological standard for the evaluation of the performance of target enrichment in next-generation sequencing experiments. We provide a software named NGSrich to apply our methods in a high-throughput fashion as a part of an integrated laboratory pipeline (Fig. 1). As the near future will also bring sequencers of the second generation into smaller laboratories, the rapid analysis of the enrichment performance for different samples and enrichment techniques also needs to become achievable without the intensive help of bioinformatics experts.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Output reports produced by the NGSrich software after the analysis of 18 whole-exome-enriched human DNAs using three different technologies: (A) report for sample IL1 (Illumina TruSeq Exome) with per chromosome bar plots for a quick overview of the coverage across the genome and a pie chart giving the fractions of target regions covered to a particular average depth and (B) summary report across the total six exomes enriched with TruSeq technology.

As an intriguing application of our software, we show the results of a comparison study of the quality and features of three enrichment products, which have recently been launched by leading companies and established in many sequencing laboratories around the globe (Fig. 2).

Materials and Methods

Statistical Framework

A target region specification is usually given as a browser extensible data (BED) file containing multiple lines or target regions per gene. The initial step of our approach to target enrichment evaluation is to scan this file for overlapping regions and merge any two of them into one contiguous target line. For a formal description, we first denote the total number of reads aligned to the genome by R and the number of reads covering a genomic position k by c_k. Considering a particular gene G represented by multiple target intervals G₁, …, G_n each of which typically comprises one exon, we denote the average coverage of a particular target region G_i by

and the average coverage of gene G by

To quantify the success of the enrichment procedure, we evaluate the percentage of target bases m ∈ G_j for which c_m exceeds a certain threshold and compare the mean coverage C(G_j) across the target regions. In this perspective, genes with very low or very high coverage, that is, those which are most affected by the unevenness of the enrichment procedure across the exome or target region, are of particular interest. Furthermore, to give evidence on the fraction of data that is usable at all, we calculate the percentage of reads that align to any target region G_j. As many on-target reads usually map across the target boundaries, we include the reads aligning 100 bp upstream or downstream of the target region.

For a target interval G_j, we define an enrichment factor as the ratio C(G_j)/E between the average coverage in the respective target region [C(G_j) as defined above] and the average coverage expected without any enrichment (E = LR/g, where L is the read length, and g is the genome size in base pairs). We assess the uniformity over a particular target region graphically by histogram plots.

To compare the performance of different samples and technologies, we propose calculating the number of reads aligning on the target region per kilobase of target sequence normalized by the total number of reads aligning to the genome, that is,

where R_t is the number of reads on target, and g_t is the length of the target region. As before, we include into R_t the reads aligning 100 bp upstream or downstream of the target regions into this statistic. Following the terminology of an approach to transcript quantification in RNA-Seq (Reads per Kilobase and Million mappable reads “RPKM”) proposed by Mortazavi et al. (2008), we usually call this ratio TPKM (on-Target reads Per Kilobase target region and Million mappable reads).

Implementation

The NGSrich pipeline was mainly written in Java and requires an installation of the Java Runtime Environment and the R package for statistical computing. Details on the parameters are accessible at the Linux command line and in manual files distributed together with the package. The starting point for the NGSrich pipeline is the specification of the target regions as a BED file as typically provided by the kit manufacturers and a read alignment file in binary alignment/map (BAM) or sequence alignment/map (SAM) format, which now form a de facto standard for the storage of aligned DNA sequencing data [Li et al., 2009]. The output is given as an HTML-based summary report for convenient access with a standard web browser. In addition, target-specific coverage information given in BED format and wiggle files appropriate for visualization in a genome browser are placed in a subdirectory. As sequencing experiments on multiple samples are a very common application, a project summary report providing summary statistics and links to the sample-specific reports can be created with an additional module.

Integration into a Computational Pipeline

We developed the NGSrich software as a part of a computational pipeline used for high-throughput analysis of target-enriched resequencing experiments (Fig. 3). A compute cluster and the ELANDv2 software are used to run fully automated reference alignments operated by a relational database supplied with sample and species information. Variants called with CASAVA v1.8, SAMtools v0.1.17, and Dindel v1.01 [Albers et al., 2011] are also organized in this database and evaluated with further (yet unpublished) analysis scripts. Detailed reports on the target enrichment performance produced by NGSrich and reports from basic quality checks of the raw data using FastQC (available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) are copied to a local web server. Using an efficient user management, these can be accessed by a predetermined group of partners defined by the pipeline operators. Our analysis system has been successfully used for processing of more than 400 exomes and targeted resequencing studies [e.g., Huebner et al., 2011].

Software Availability

The NGSrich 64-bit binaries and source code can be accessed at http://sourceforge.net/projects/ngsrich/files under the General Public License (GPL3).

Next-Generation DNA Sequencing and Alignment

DNA was extracted from peripheral blood of 18 individuals and randomly fragmented using sonification technology (Covaris, Inc., Woburn, MA). The fragments were end repaired and adapters were ligated. After library preparation, the samples were submitted to the enrichment process. For six samples (AG1 to AG6), we used SureSelect Human AllExon 50Mb technology following the SureSelect Target Enrichment for Illumina Paired End Multiplexed Sequencing Protocol, version 1.2, January 2011 (Agilent Technologies, Santa Clara, CA). For another six samples (NG1 to NG6), we used SeqCap EZ Human ExomeLibrary v2.0 kit following the NimbleGen SeqCap EZ Exome Library SR User's Guide, the Targeted Sequencing with NimbleGen SeqCap EZ Libraries, and Illumina TruSeq DNA Sample Preparation Kits Technical Note 2011 (Roche NimbleGen Inc., Madison, WI), whereas the last six samples (IL1 to IL6) were enriched using the TruSeq Exome Enrichment Kit following the TruSeq Enrichment Guide, Part #15013230 Rev. E June 2011 (Illumina Inc., San Diego, CA). For the Agilent SureSelect data, samples AG5 and AG6 were sequenced as part of a 6-plex library, which was loaded to four lanes of a flow cell, and the remaining samples AG1 to AG4 were sequenced on one lane each. Samples NG1 and NG2 were sequenced as part of a 4-plex library on three lanes, whereas the remaining samples enriched with the NimbleGen technology were sequenced as part of a 6-plex library on four lanes. The Illumina-enriched samples IL1 to IL6 were sequenced as a 6-plex library and loaded to four lanes. For all libraries, 100 bp were sequenced from both ends of the fragments on an Illumina HiSeq 2000 (Illumina Inc., San Diego, CA) sequencing instrument. The data were filtered according to signal purity by the Illumina realtime analysis software v1.8. Subsequently, the reads were mapped to the Human Genome reference build version hg19 using the ELANDv2 alignment algorithm on a multinode compute cluster. Using CASAVA v1.8, PCR duplicates were filtered out and the output was converted to BAM format.

Results

Initially, we used our statistical framework and the NGSrich software on a high-performance compute cluster to process data from one human exome enriched by Illumina TruSeq technology. From this, we got a browsable, HTML-based output including target level details for convenient use in a server-client environment (Fig. 1). We also successfully used the software for evaluation of customized microarray-based capture assays (unpublished data).

Subsequently, we employed our software to do a systematic evaluation of three human exome enrichment techniques. Our comparison study comprises 18 exomes prepared using (1) the Agilent SureSelect Human AllExon 50Mb kits (Agilent Technologies), (2) Roche NimbleGen SeqCap EZ Human ExomeLibrary v2.0 technology (Roche NimbleGen Inc.), and (3) the Illumina TruSeq Exome Enrichment Kit. Sequencing of the respective libraries on an Illumina HiSeq 2000 sequencing instrument resulted in between 109.8 and 225.0 million reads after Illumina signal purity filtering, whereas the number of reads that could be aligned to the hg19 reference genome ranged between 64.0 and 180.5 million (Table 1).

Table 1. Performance of the Target Enrichment Using (A) Agilent SureSelect 50 Mb (51.8 Mb Target Size), (B) NimbleGen In-Solution Version 2 (44.2 Mb Target Size), and (C) Illumina TruSeq Kit (62.3 Mb Target Size)

		AG1	AG2	AG3	AG4	AG5	AG6
(A)	# Reads (purity filtered)	225,009,312	211,175,642	213,538,894	217,112,592	194,748,944	187,923,012
	# Reads aligned	179,770,884	174,950,182	177,255,570	180,457,086	135,669,836	127,382,464
	Percentage on target ±100 bp	62.36	59.15	58.99	57.72	70.95	69.85
	Coverage mean	181.19	167.42	169.35	167.67	154.72	142.52
	Coverage SD	141.91	132.83	130.79	131.71	105.65	95.29
	Percentage covered ≥30×	86.28	85.94	86.44	86.18	88.08	87.48
	TPKM	12.05	11.43	11.40	11.15	13.71	13.50
	# Genes covered ≤2×	142	138	131	130	130	134
	# Genes covered ≥200×	7,141	5,181	5,263	5,046	3,081	2,124
		NG1	NG2	NG3	NG4	NG5	NG6
(B)	# Reads (purity filtered)	213,433,760	184,253,150	140,951,508	150,525,670	133,251,350	154,577,208
	# Reads aligned	153,953,510	124,548,136	102,485,108	95,384,682	69,631,082	127,825,842
	Percentage on target ±100 bp	76.85	78.28	77.85	75.92	76.50	79.01
	Coverage mean	211.36	175.11	143.54	129.76	95.94	183.05
	Coverage SD	128.95	102.32	86.82	74.02	53.31	119.29
	Percentage covered ≥30×	93.86	93.46	93.04	93.15	90.32	93.41
	TPKM	17.37	17.70	17.60	17.16	17.30	17.86
	# Genes covered ≤2×	131	132	117	130	134	125
	# Genes covered ≥200×	9,147	5,072	1,312	581	106	5,981
		IL1	IL2	IL3	IL4	IL5	IL6
(C)	# Reads (purity filtered)	198,899,062	174,905,824	118,359,918	111,670,074	122,712,392	109,805,482
	# Reads aligned	121,015,490	107,662,514	72,118,600	64,008,828	73,591,594	64,613,284
	Percentage on target ±100 bp	69.67	71.24	69.61	70.85	71.10	70.44
	Coverage mean	111.35	101.05	66.48	59.75	68.98	60.15
	Coverage SD	63.77	59.85	38.37	32.78	38.72	33.72
	Percentage covered ≥30×	84.11	82.81	76.9	76.77	78.54	75.53
	TPKM	11.19	11.44	11.18	11.37	11.42	11.31
	# Genes covered ≤2×	50	47	71	63	62	73
	# Genes covered ≥200×	568	411	118	53	116	79

In the aligned reads count, PCR duplicate reads are already filtered out, which explains the big difference in count between raw and aligned reads.

Regarding the enrichment performance (Table 1), the average target coverage ranged from 141.64 to 178.48 (Agilent Technologies), from 93.03 to 205.23 (Roche NimbleGen Inc.), and from 59.16 to 110.25 (Illumina Inc.). The analysis also shows that the target is covered at least 30× in between 61.32% and 85.63% (Agilent Technologies), between 88.52% and 92.95% (Roche NimbleGen Inc.), and between 75.26% and 83.97% (Illumina Inc.) of the total target region length. The latter statistics highly depend on the total amount of sequence data available and the target region specifications and size, which are provided by the respective vendor. To address the variability in the number of sequences, we repeated the evaluation with a random selection of 60 million reads from each sample, which essentially confirmed that the Roche NimbleGen product performs better than the other assays (Supp. Table S1). However, as the read count generated for resequencing actually needs to account for the target size, this analysis obviously favors a product with smaller regions for target capture.

On the other hand, the percentage of reads that align in the target regions or 100 bp upstream or downstream ranged between 57.72% and 70.95% (Agilent Technologies), between 75.92% and 79.01% (Roche NimbleGen Inc.), and between 69.61% and 71.24% (Illumina Inc.). Using the TPKM statistic for a normalized comparison between different samples and library preparation techniques (see Materials and Methods), the average and standard deviation of the values are 12.2 ± 1.12 (Agilent Technologies), 17.5 ± 0.27 (Roche NimbleGen Inc.), and 11.32 ± 0.11 (Illumina Inc.). A gene-level exploration of the data shows that between 141 and 143 (Agilent Technologies), between 115 and 135 (Roche NimbleGen Inc.), and between 44 and 72 (Illumina Inc.) are covered below 2× (Fig. 2). Moreover, 34 genes were poorly enriched (covered below 2×) in all six samples prepared with Illumina technology, 101 in those prepared with Roche NimbleGen technology, and 101 in those prepared with Agilent technique, again using average gene-wise coverage below 2× as a cutoff (Supp. Table S2). Taking together the three exome specifications, there are 20,925 RefSeq genes represented in total. Among these are 737 genes, which are not included in the Agilent target (but in the other two specifications), 1,737, which are not on the Roche NimbleGen, and 599, which were not on the Illumina target (Supp. Table S3). To exclude that the results are biased by low-quality alignments, which could be due to repetitiveness or structural peculiarities of the target sequences, we repeated our analysis using only those reads that have a mapping quality value of 30 or higher. We observed that the statistics are actually very robust against this (Supp. Table S4) and conclude that on a whole-exome level, the length of unmappable regions is rather small compared with the total length of the target sequences.

The exome coverage is often considered sufficient when it is 30× or higher in at least 80% of the target sequence. Our data allow the conclusion that the amount of sequence needed to achieve this is 14 Gb for the Agilent technology, 5 Gb for Roche NimbleGen, and 9 Gb for the Illumina protocol (Table 2), implying that in a multiplex library on one HiSeq 2000 lane, two (Agilent Technologies), five (Roche NimbleGen Inc.), or three (Illumina Inc.) human exomes can be processed, depending on the enrichment product that is being used.

Table 2. Some Characteristics of the Enrichment Assays

	Target size (Mb)	Preparation time (days)	Covering 80% of target at least 30× (Gb)	Exomes per lane
Agilent Technologies	50	3	14	2
NimbleGen	44	5	5	5
Illumina	62	4	9	3

Using the Agilent SureSelect protocol, more sequence is needed to achieve a sufficient coverage than for the other products. On the other hand, the Roche NimbleGen approach not only covers the smallest number of genes but also has got the longest library preparation times and is less flexible than the other techniques.

To benchmark the results produced by our NGSrich software against a different approach, we generated a pileup version of the respective sample BAM files using the mpileup command of the SAMtools, version 0.1.17 [Li et al., 2009]. Subsequently, we evaluated the coverage information from the pileup file using basic statistical functions in R. Strikingly, this rather direct evaluation of coverage parameters matches the results produced by the software very closely (Table 3).

Table 3. Repeated Analysis of the Target Enrichment Performance Using the Pileup Command of the SAMtools, Version 0.1.17, with Subsequent Evaluation of the Coverage Column by Standard Commands in R for (A) the Agilent SureSelect 50 Mb, (B) the NimbleGen In-Solution Version 2, and (C) for the Illumina TruSeq Samples

		AG1	AG2	AG3	AG4	AG5	AG6
(A)	Coverage mean	177.23	162.63	164.37	162.88	151.66	139.68
	Coverage SD	154.71	143.47	141.53	143.01	113.95	103.95
	Percentage covered ≥10×	92.63	92.58	92.62	92.79	93.78	93.69
	Percentage covered ≥30×	85.89	85.49	86.27	86.12	87.72	87.10
		NG1	NG2	NG3	NG4	NG5	NG6
(B)	Coverage mean	207.16	171.48	140.49	127.21	93.73	179.37
	Coverage SD	133.71	106.40	90.42	76.74	56.07	124.78
	Percentage covered ≥10×	96.12	96.15	96.21	96.34	95.54	96.19
	Percentage covered ≥30×	93.47	92.98	92.82	93.05	89.94	92.92
		IL1	IL2	IL3	IL4	IL5	IL6
(C)	Coverage mean	106.14	96.16	63.19	56.59	65.47	57.21
	Coverage SD	80.45	73.91	48.85	39.51	48.59	42.88
	Percentage covered ≥10×	90.84	90.19	88.69	88.73	89.17	88.72
	Percentage covered ≥30×	83.78	82.42	76.09	76.78	77.87	75.62

The coverage statistics generated by this direct approach essentially confirm those produced by NGSrich. Using the software, mean and standard deviation are calculated over the mean values per target region, so a slightly aberrant value is to be expected.

Discussion

A fair cross-platform comparison needs to take into account that the definition of the exome and the total size of the target regions are variable across platforms. We have, therefore, proposed to normalize the amount of on-target sequence to a per-kilobase value of the target reference. To address variability in sequence cluster densities and other issues, we have also proposed to normalize by the total amount of sequence and express both in a statistic called TPKM (see Materials and Methods). This statistic measures the amount of sequence over the specified target region just as the RPKM statistic does for a particular transcript [Mortazavi et al., 2008]. The larger the target region, the smaller the TPKM is for a fixed percentage of reads aligning on target. On the other hand, the larger the target region, the more on-target reads are needed to achieve the same TPKM value for a fixed total number of reads. Regarding the results produced by our software for the 18 exomes, it shows up that the TPKM value is very robust across samples enriched by the same technology and therefore suitable as a characteristic measure of a particular technique.

The observation that the TPKM values are clearly larger for the NimbleGen technology than for the other two approaches reflects in part that this protocol has the smallest total length of target regions. However, the percentage of on-target reads and other statistics also show that this approach actually performs better than the other techniques under consideration. In contrast, the smallest number of genes covered insufficiently (below 2×) is achieved by the Illumina TruSeq exome, although this technology has the largest total target region length. On the other hand, the per-sample time and cost needed for library preparation will always contribute to a decision for using a specific technology. Although the Agilent SureSelect assay (Agilent Technologies) has the poorest and least robust enrichment performance, it is comparatively fast and the handling and training for the workflow are easy especially for small projects. For the NimbleGen technology, the performance is superior to the other assays but this method has the longest preparation time and is therefore less flexible than the others. Also, the size of the target region and thus the gene content is smaller than for the other assays. The Illumina technology provides the easiest workflow for large-scale projects because the samples of a multiplex library can be pooled beforehand and, consequently, the enrichment process has to be run only once. Finally, the number of exomes that can be processed on one lane has direct impact on the sequencing cost, regardless of the price paid per gigabase.

The performance of the enrichment process also depends on the structure of the target sequence, for example, repetitiveness, GC content, and other issues. The general observation that a nonnegligible fraction of the genes covered insufficiently in a specific sample is poorly enriched also in most of the other samples processed by the same product (Supp. Table S2) suggests that failure of a specific subset of the target can be considered an assay-specific characteristic. Because the starting point of our software tool is the vendor-provided target specification file, our study can only measure the assay performances against the description of the vendors rather than using the exact coordinates of the hybridization targets. The way how target regions are declared is highly variable between the products. Regarding this, a reanalysis with respect to a reference exome downloaded from the current University of California at Santa Cruz (UCSC) annotation database showed that due to its more comprehensive capture regions, the Illumina TruSeq exome has the largest fraction of this reference exome covered to at least 30× coverage (Supp. Table S5), whereas a lot of genes is missed with the NimbleGen in-solution assay (Roche NimbleGen Inc.). This also suggests that the target definition is a very important point to consider when choosing a particular enrichment product. In addition, the way in which the performance depends on structural peculiarities of the underlying sequences could be subjected to further studies. The sequence capture methods reported here have recently been compared in several other studies all of which concluded that the Roche NimbleGen SeqCap (Roche NimbleGen Inc.) produces a higher percentage of reads, which can be aligned on target than the other assays do [Asan et al., 2011; Clark et al., 2011; Parla et al., 2011; and Solunen et al., 2011]. Clark et al. (2011) also mentioned that the Illumina platform is the only one to contain untranslated regions (UTRs) and hypothesized that the superior performance of the Roche NimbleGen assay (Roche NimbleGen Inc.) is due to its higher bait density.

Our NGSrich software meets the very common demand for a detailed, summarized, and exon-wise analysis of the target enrichment performance of next-generation sequencing libraries. Combined with a scriptable operation mode and HTML-based output reports for efficient integration into a web server, our software is capable of being used with extremely high throughput. With the publication of our software, we have closed an essential gap in the availability of software tools needed to build fast and efficient computational pipeline systems for resequencing experiments.

Supporting Information

References

Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. 2011. Dindel: accurate indel calls from short-read data. Genome Res 21: 961–973.
10.1101/gr.112326.110
CAS PubMed Web of Science® Google Scholar
Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock GM, Gibbs RA. 2007. Direct selection of human genomic loci by microarray hybridization. Nat Methods 4: 903–905.
10.1038/nmeth1111
CAS PubMed Web of Science® Google Scholar
Asan NF, Xu Y, Jiang H, Tyler-Smith C, Xue Y, Jiang T, Wang J, Wu M, Liu X, Tian G, Wang J, Yang H, Zhang X. 2011. Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol 12: R95.
10.1186/gb-2011-12-9-r95
CAS PubMed Web of Science® Google Scholar
Clark MJ, Chen R, Lam HYK, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M. 2011. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol 29: 908–914.
10.1038/nbt.1975
CAS PubMed Web of Science® Google Scholar
Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C. 2009. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 27: 182–189.
10.1038/nbt.1523
CAS PubMed Web of Science® Google Scholar
Huebner AK, Gandia M, Frommolt P, Maak A, Wicklein EM, Thiele H, Altmüller J, Wagner F, Viñuela A, Aguirre LA, Moreno F, Maier H, Rau I, Gießelmann S, Nürnberg G, Gal A, Nürnberg P, Hübner CA, del Castillo I, Kurth I. 2011. Nonsense mutations in SMPX, encoding a protein responsive to physical force, result in X-chromosomal hearing loss. Am J Hum Genet 88: 1–7.
10.1016/j.ajhg.2011.04.007
CAS Web of Science® Google Scholar
Kiialainen A, Karlberg O, Ahlford A, Sigurdsson S, Lindblad-Toh K, Syvänen AC. 2011. Performance of microarray and liquid based capture methods for target enrichment for massively parallel sequencing and SNP discovery. PLoS One 6: e16486.
10.1371/journal.pone.0016486
CAS PubMed Web of Science® Google Scholar
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25: 2078–2079.
10.1093/bioinformatics/btp352
CAS PubMed Web of Science® Google Scholar
Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ. 2010. Target enrichment strategies for next-generation sequencing. Nat Methods 7: 111–118.
10.1038/nmeth.1419
CAS PubMed Web of Science® Google Scholar
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5: 621–628.
10.1038/nmeth.1226
CAS PubMed Web of Science® Google Scholar
Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR. 2011. A comparative analysis of exome capture. Genome Biol 12: R97.
10.1186/gb-2011-12-9-r97
CAS PubMed Web of Science® Google Scholar
Solunen AM, Ellonen P, Almusa H, Lepistö M, Eldfors S, Hannula S, Miettinen T, Tyynismaa H, Salo P, Heckman C, Joensuu H, Raivio T, Suomalainen A, Saarela J. 2011. Comparison of solution-based exome capture methods for next-generation sequencing. Genome Biol 12: R94.
10.1186/gb-2011-12-9-r94
CAS PubMed Web of Science® Google Scholar
Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH, Kotsopoulos SK, Samuel ML, Hutchison JB, Larson JW, Topol EJ, Weiner MP, Harismendy O, Olson J, Link DR, Frazer KA. 2009. Microdroplet-based PCR enrichment for large-scale targeted sequencing. Nat Biotechnol 27: 1025–1031.
10.1038/nbt.1583
CAS PubMed Web of Science® Google Scholar

Citing Literature

All articles

Filename	Description
humu_22036_sm_SuppInfo.pdf147.8 KB	Supporting Information
humu_22036_sm_SuppInfo.xlsx864 KB	Supporting Information

Assessing the Enrichment Performance in Targeted Resequencing Experiments^†

Abstract

Introduction