Volume 15, Issue 5 pp. 1145-1152

Resource Article

Open Access

Less is more: extreme genome complexity reduction with ddRAD using Ion Torrent semiconductor technology

Lilian Pukk,

Corresponding Author

Lilian Pukk

Department of Aquaculture, Estonian University of Life Sciences, Tartu, 51006 Estonia

Correspondence: Lilian Pukk, Fax: +372 7313 489; E-mail: [email protected]Search for more papers by this author

Freed Ahmad,

Freed Ahmad

Department of Information Technology, University of Turku, Turku, 20014 Finland

Search for more papers by this author

Shihab Hasan,

Shihab Hasan

Bioinformatics Lab, QIMR Berghofer Medical Research Institute, Brisbane, Qld, Australia

School of Medicine, The University of Queensland (UQ), Brisbane, Qld, 4006 Australia

Search for more papers by this author

Veljo Kisand,

Veljo Kisand

Institute of Technology, University of Tartu, Tartu, 50411 Estonia

Search for more papers by this author

Riho Gross,

Riho Gross

Department of Aquaculture, Estonian University of Life Sciences, Tartu, 51006 Estonia

Search for more papers by this author

Anti Vasemägi,

Anti Vasemägi

Department of Biology, University of Turku, Turku, 20520 Finland

Search for more papers by this author

Lilian Pukk,

Corresponding Author

Lilian Pukk

Department of Aquaculture, Estonian University of Life Sciences, Tartu, 51006 Estonia

Correspondence: Lilian Pukk, Fax: +372 7313 489; E-mail: [email protected]Search for more papers by this author

Freed Ahmad,

Freed Ahmad

Department of Information Technology, University of Turku, Turku, 20014 Finland

Search for more papers by this author

Shihab Hasan,

Shihab Hasan

Bioinformatics Lab, QIMR Berghofer Medical Research Institute, Brisbane, Qld, Australia

School of Medicine, The University of Queensland (UQ), Brisbane, Qld, 4006 Australia

Search for more papers by this author

Veljo Kisand,

Veljo Kisand

Institute of Technology, University of Tartu, Tartu, 50411 Estonia

Search for more papers by this author

Riho Gross,

Riho Gross

Department of Aquaculture, Estonian University of Life Sciences, Tartu, 51006 Estonia

Search for more papers by this author

Anti Vasemägi,

Anti Vasemägi

Department of Biology, University of Turku, Turku, 20520 Finland

Search for more papers by this author

First published: 21 February 2015

https://doi.org/10.1111/1755-0998.12392

Citations: 20

Share a link

Email
Wechat
Bluesky

Abstract

Massively parallel sequencing a small proportion of the whole genome at high coverage enables answering a wide range of questions from molecular evolution and evolutionary biology to animal and plant breeding and forensics. In this study, we describe the development of restriction-site associated DNA (RAD) sequencing approach for Ion Torrent PGM platform. Our protocol results in extreme genome complexity reduction using two rare-cutting restriction enzymes and strict size selection of the library allowing sequencing of a relatively small number of genomic fragments with high sequencing depth. We applied this approach to a common freshwater fish species, the Eurasian perch (Perca fluviatilis L.), and generated over 2.2 MB of novel sequence data consisting of ~17 000 contigs, identified 1259 single nucleotide polymorphisms (SNPs). We also estimated genetic differentiation between the DNA pools from freshwater (Lake Peipus) and brackish water (the Baltic Sea) populations and identified SNPs with the strongest signal of differentiation that could be used for robust individual assignment in the future. This work represents an important step towards developing genomic resources and genetic tools for the Eurasian perch. We expect that our ddRAD sequencing protocol for semiconductor sequencing technology will be useful alternative for currently available RAD protocols.

Introduction

As the introduction of next-generation sequencing (NGS) technologies, whole genome sequencing has been carried out on an increasing number of species, triggering a major breakthrough in genetics. For many studies, however, obtaining the complete genome sequence from a large number of individuals still remains prohibitively expensive and is not even necessary. Therefore, several genome complexity reduction methods which allow only a subset of the genome to be sequenced have been recently described (Davey et al. 2011). These methods rely on the generation of reduced representation libraries (RRL), generated by restriction enzymes (Altshuler et al. 2000), selective amplification (van Orsouw et al. 2007), targeted amplicon sequencing (Hyten et al. 2010), capture probes (Shen et al. 2013), methylation filtering (Palmer et al. 2003), high-C₀t selection (Yuan et al. 2003) or by sequencing the transcribed proportion of the genome, that is transcriptome (Barbazuk et al. 2007).

One of the most widely used genome complexity reduction methods that have been applied in both model (e.g. humans, laboratory mice and Drosophila) and nonmodel species is restriction-site associated DNA (RAD) sequencing. This method targets DNA sequences flanking specific restriction enzyme cutting sites throughout the genome (Baird et al. 2008; Hohenlohe et al. 2010). To date, RAD sequencing has been applied in various research fields ranging from molecular evolution and evolutionary biology (Emerson et al. 2010) to animal and plant breeding (Yang et al. 2012), conservation (Sharma et al. 2012) and forensics (Ogden et al. 2013). Importantly, RAD sequencing is highly suitable for generating genome-wide genotype data in situations where not much is known about the target genome. As a result, RAD sequencing has been increasingly used in population genetic, biogeographic and phylogenetic studies of nonmodel organisms providing a genome-wide view on the role of different evolutionary forces during population differentiation, adaptation and speciation (Baxter et al. 2011; Bruneaux et al. 2013; Keller et al. 2013).

Earlier RAD library preparation protocols typically consisted of the following steps: restriction enzyme digestion, ligation of the first adapter, physical shearing, end-repair, ligation of the second adapter and size selection. However, more recent studies have simplified, and further developed, the library preparation procedure by eliminating physical shearing and introducing two restriction enzyme digestion with strict size selection (termed double RAD (Bruneaux et al. 2013) or double-digest RAD (Peterson et al. 2012)). These modifications enable greater flexibility in targeting the optimal number of genomic regions ranging from thousands of consensus sequences required for QTL mapping to hundreds of thousands of contigs necessary for association studies and hitchhiking mapping.

Most of the RAD studies to date have been carried out using Illumina GAII and HiSeq NGS platforms, but to fully exploit the potential of RAD sequencing, other NGS technologies can be used. For example, the new semiconductor sequencing technology (Ion Torrent PGM) represents a simple, fast (hours instead of days) and flexible solution for small laboratories and individual research groups, but currently, only a few studies have reported the utility of Ion Torrent PGM technology for RAD sequencing (Mascher et al. 2013; Kai et al. 2014). This can be partially explained by the relatively small sequencing throughput of PGM technology, which produces tens of millions of short reads compared to, for example, hundreds of millions of sequences generated by the Illumina HiSeq platform. However, for RAD projects which require sequencing of a smaller number of genomic regions, such as QTL mapping and basic population genetic analysis, Ion Torrent PGM technology is more accessible to smaller research groups making it a faster and simpler alternative compared to other NGS platforms. Therefore, it is important to further develop and evaluate RAD sequencing protocols adopted for Ion Torrent PGM that enable high genome complexity reduction by targeting a relatively small number of genomic regions while retaining the high coverage necessary for robust SNP discovery and reliable estimation of various population genetic parameters.

In this study, we describe the development of a two-enzyme RAD sequencing approach for Ion Torrent PGM platform by carrying out extreme genome complexity reduction using rare-cutting restriction enzymes and strict size selection of the library while using DNA pooling as a cost-effective approach to estimate allele frequencies on a genome-wide scale. We applied this approach to the common freshwater fish species, the Eurasian perch (Perca fluviatilis L.). Eurasian perch contributes significantly to both commercial and recreational fisheries in the coastal Estonian waters of the Baltic Sea and in Lake Peipus, the fifth largest lake in Europe (Moran 2003; Pukk et al. 2013). Because of different fishing regulations in these water bodies, the ability to use genetic markers for assigning individuals back to their source population would be of considerable importance for the identification of illegal fishing and fish trade. To develop genomic resources and genetic tools for the Eurasian perch, which could be used in a court of law, we: i) developed a restriction-site associated DNA (RAD) sequencing approach for Ion Torrent PGM platform which results in extreme genome complexity reduction allowing sequencing of a relatively small number of genomic fragments with high sequencing depth; ii) characterized a small proportion of the genome and identified more than a thousand single nucleotide polymorphisms (SNPs) in perch; iii) estimated genetic differentiation between the DNA pools constructed based on individuals collected from freshwater (Lake Peipus) and brackish water (the Baltic Sea) environments; and iv) identified SNPs with the strongest signal of differentiation which could be used for individual assignment in the future.

Materials and methods

DNA isolation and ddRAD library preparation

Seventy-six wild Eurasian perch (Perca fluviatilis L.) were used for ddRAD library preparation. Genomic DNA was isolated from fin clips and dried scale samples as described by Pukk et al. (2013). These samples were divided into two pools: the Baltic Sea pool (Turku Bay, n = 33; Matsalu Bay, n = 6 and Pärnu Bay, n = 11) and Lake Peipus pool (n = 26). DNA quality and concentration were assessed by agarose gel electrophoresis and with a NanoDrop spectrophotometer (Thermo Fisher, Inc.). RAD library preparation protocol broadly followed the methods described by Bruneaux et al. (2013), with some modifications as outlined by Pukk et al. (2014). In short, pooled DNA (800 ng) was digested for 2 h at 37 °C in 20 μL reaction volume, simultaneously with two restriction enzymes, 20 U of PstI (restriction site 5′ CTGCAG 3′) and BamHI (restriction site 5′ GGATCC 3′) (New England Biolabs) and heat-inactivated for 15 min at 75 °C (Fig. 1). The ligation consisted of forward (0.016 pmol) and reverse (0.004 pmol) adapters which were added to 20 μL of restriction reaction together with 5 μL 10 × T4 DNA Ligase buffer and 1200 U of T4 DNA ligase (M0202S; NEB) (forward adapter, top: 5′ CCATCTCATCCCTGCGTGTCTCCGACTCAGXXXXXTGCA 3′, forward adapter, bottom: 5′ XXXXXCTGAGTCGGAGACACGCAGGGATGAGATGG 3′, where XXXXX is a barcode sequence; reverse adapter, top: 5′ GATCATCACCGACTGCCCATAGAGAGG 3′, reverse adapter, bottom: 5′ CCTCTCTATGGGCAGTCGGTGAT 3′). To differentiate between the pools, two following 5 bp barcodes were used (Sea pool: 5′ AGAAC 3′ and Lake pool: 5′ TCGTT 3′) (Fig. 1). The 50 μL ligation reaction was carried out at 22 °C for 1 h, heat-inactivated for 30 min at 65 °C. Each library was subsequently loaded onto E-Gel^® SizeSelect 2% Agarose Gel (Life Technology) to extract DNA fragments of approximately 300 bp length (ranging between 200 and 351 bp). Adapter-ligated products were nick-translated and PCR-amplified in 67.5 μL volume containing 14.1 μL of E-Gel extraction product, 50 μL of Platinum PCR SuperMix High Fidelity (Invitrogen) and 0.25 pmol of Ion Torrent primers A and P1. PCR consisted of 72 °C for 20 min, 95 °C for 5 min followed by 18 cycles of 95 °C for 15 s, 62 °C for 15 s with a final extension step at 68 °C for 1 min. Libraries were then purified twice using a 1.8-fold volume of Solid-phase reversible immobilization (SPRI) bead solution (Meyer & Kircher 2010) to remove fragments smaller than 100 bp.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

The adapters and primers for Ion Torrent sequencing technology. Genomic DNA (gDNA) is simultaneously digested with two restriction enzymes (PstI and BamHI) producing fragments with 5′ TGCA '3 and 5′ GATC ‘3 overhangs, respectively. The 5-bp barcode in forward adapter is marked XXXXX.

In the final RAD libraries, the majority of the fragments (65.4% and 67.9% for Sea and Lake pools, respectively) were between 262 and 299 bp (Agilent 2100 bioanalyser). The concentration of the libraries was measured using a Qubit 2.0 Fluorometer (Invitrogen) and then diluted to 28 pm. Sample emulsion PCR, emulsion breaking and enrichment were performed using an Ion Xpress Template Kit, according to the manufacturer's instructions. The libraries (100 μL) were loaded on Ion 314 (10 μL) and 318 (90 μL) chips and sequenced with an Ion Torrent PGM.

Quality control, de novo assembly, mapping and SNP calling

The sequence data acquired using Ion Torrent semiconductor technology were analysed as follows. All the raw sequence reads from the Ion 314 and 318 chips were merged into a single FASTQ file and subsequently split into two (Sea and Lake), based on barcode sequences using fastx_barcode_splitter.pl implemented in fastx – Toolkit version 0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/index.html). All three files were then trimmed using a fastq/a Trimmer with a fastx – Toolkit. First, one bp was removed from the 3′ end of all merged raw reads to avoid integration of an extra nucleotide at the end of the restriction enzyme (BamHI) cutting site. Second, after dividing all the reads into Sea and Lake data sets, 10 bp were trimmed from the 5′ end to remove the barcodes. Subsequently, three filtering criteria were used: (i) reads with lengths of 30–300 bp were retained; (ii) reads were removed if 80% or more of a sequence had a quality score (QS) < 13 and (iii) using a sliding window method with the quality threshold set to 20 and minimum read length to 20 bp. Custom Python scripts (available on Dryad) and computer cluster at the Computing Centre of Finland (CSC; http://www.csc.fi/) were used for all analyses. Following filtering, data quality was checked using prinseq-lite v.0.17.3 (Schmieder & Edwards 2011), after which de novo assembly was performed using mira version 3.9.15 (Chevreux et al. 1999) with job specified as genome, de novo and accurate in manifest file using Ion Torrent specific settings. After de novo assembly, the following filtering was carried out to exclude contigs that most likely contain multiple loci. First, using tablet version 1.13.07.31 (Milne et al. 2013), all the contigs with more than a 2.5% mismatch were discarded. This threshold was set based on the overall distribution of mismatch percentage from the whole data set. Similarly to mismatch percentage, we evaluated the distribution of sequencing depth of all contigs and excluded contigs with more than 500× coverage from the subsequent analysis to avoid repetitive sequences. To avoid mis-assembled contigs due to chimeric reads, the remaining contigs were split into two or three shorter contigs, depending upon the number of uncut restriction sites found. The contigs consisting of 40 or more nucleotides were selected as reference for mapping in mira. For mapping, two files containing population-specific reads were used. The longer reads containing uncut motifs were split from the point of restriction site into shorter (> 20 bp) reads. The job mentioned in the manifest file for mira was genome, mapping and accurate, and the chimeric read clipping option was also used. The BAM file produced by mira was then used as an input for samtools v.0.1.18 (Li et al. 2009) for SNP discovery. Potential SNPs were accepted only if they met the following stringent selection criteria: read alignment depth of ≤ 500×; ≥ 5 reads per SNP allele; ≤ 2% of polymorphic sites per contig; quality threshold ≥ 20; and ≤ 2% of mismatch. To avoid incorrect SNP calling caused by frequent homopolymer sequencing errors of the Ion Torrent PGM platform, SNPs associated with three or more consecutive homopolymers were discarded.

Genetic differentiation

G_ST values for SNPs were calculated using formulae described by Bruneaux et al. (2013). To visualize the effect of coverage on genetic differentiation, a LOESS smoothing model in PAST (Hammer et al. 2001) was used (smoothing value was set at 0.15). As G_ST values were strongly affected by coverage, we also used a chi-square test to identify loci showing the strongest genetic differentiation signal between Sea and Lake pools (Günther & Coop 2013). For the 25 top SNP loci that showed the strongest genetic differentiation between Sea and Lake pools based on the chi-square test, we also calculated the 95% confidence limits of G_ST estimates by resampling with replacement (1000 times) the observed allele counts and recalculating G_ST values using poptools 3.2.5. (Hood 2011). This allowed quantification of the effect of coverage (number of reads) on genetic differentiation estimates; small and/or an uneven number of reads per contig and per population were expected to provide relatively inaccurate G_ST estimates, while genetic differentiation estimates are more precise for SNPs, showing high coverage in both population pools.

SNP validation

For SNP validation, primers for PCR amplification were developed for 24 putative SNP markers, which were selected from the pre-reanalysed data set according to their chi-square value (12 markers showing high differentiation between Lake and Sea pools) or the presence of a PstI restriction site to test the potential presence of chimeras (n = 12). For primer development, an online primer3 v.0.4.0 (Koressaar & Remm 2007; Untergasser et al. 2012) program was used and primers were selected only if they met the following selection criteria: any' and 3′ close to 0; melting temperature 57–60 °C and SNP position at least 40 bp from left or right primer to ensure successful sequencing results (primer sequences available on Dryad). Sanger sequencing of seven individuals (four from Lake Peipus and three from Pärnu Bay) was carried out at Estonian Biocentre (http://vvv.ebc.ee) using 3130×l Genetic Analyzer (Applied Biosystems). For sequence alignment and analysis, clustalw multiple alignment option in bioedit v7.2.5 (Hall 1999) was used when possible, otherwise sequences were aligned manually.

Contig annotation and categorization

RAD contigs were annotated using a blast v2.2.28 (Altschul et al. 1990) (blastx – against a Uniprot database and blastn – against a nucleotide database) with the E-value threshold set at < 1× e⁻⁰⁴. Gene names and Gene Ontology (GO) terms were retrieved from the Uniprot database according to the blastx top hit results. Repetitive genomic elements were detected using repeatmasker v4.0.3 software (Smit et al. 1996–2010).

Results

RAD tag sequencing, de novo assembly and mapping

The sequencing of the RAD libraries with 314 and 318 chips generated 126 327 and 4 045 509 raw reads, respectively. After filtering, 16 886 contigs covering 2 134 779 bp were retained (N50 contig size 82.56 bp; mean coverage 45.19×) and used as a reference for mapping. From the combined reads of both pools (1 377 041 for Sea and 1 158 807 for Lake pool), MIRA used 1 259 177 sequences (2 208 684 bp; 711 194 and 547 983 reads from the Sea and Lake pool, respectively) for mapping, with an N50 contig size of 148 bp and a mean coverage of 74.6× (Table S1, Supporting information).

Contig annotation and categorization

A total of 4369 (25.9%) consensus sequences gave significant blastn (3242 contigs; 19.2%) and/or blastx (2518 contigs; 14.9%) hits. The largest proportion of significant blastx hits came from Oreochromis niloticus (549 contigs; 21.8%), Gasterosteus aculeatus (424 contigs; 16.8%), Takifugu rubripes (233 contigs; 9.3%) and Tetraodon nigroviridis (196 contigs; 7.8%). From all of the blastx hits, GO terms were assigned to 1821 contigs (72.3%).

repeatmasker identified repetitive genomic elements in 1295 contigs. Most of these (1024) were simple repeats (2.16% of the consensus sequence length). Long terminal repeat (LTR) retroelements were found in 160 contigs (0.75% of the sequence), including L2/CR1/Rex (117 contigs; 0.60%), RTE/Bov-B (16 contigs; 0.07%) and SINEs (27 contigs; 0.08%). Transposable DNA element footprints (Tc1-IS630-Pogo) accounted for a further 0.01% of the sequence (3 contigs).

SNP identification and classification

When all available sequence reads (1 259 177) were used against the assembled reference sequence for SNP discovery, 7325 potential SNPs were identified. After stringent filtering, a total of 1259 putative SNPs remained (Table S2, Supporting information), corresponding to a SNP density of one per 1754 bp. The most common SNP variant consisted of G/A or C/T. The distribution of the SNPs along the contigs showed that the frequency of identified polymorphisms decreased after the first 100–150 bp (Fig. 2), indicating that an elevated sequencing error rate towards the end of the reads most likely does not affect the identified 1259 SNPs. The mean overall coverage so-called SNPs was 81.7× (mean coverage of Sea and Lake pools: 46.1× and 35.5×, respectively), and 842 of SNPs (67%) had ten or more reads per DNA pool. In total, 32.9% of the filtered SNP-containing contigs had significant blastn (23.3%; 293) and/or blastx (18.8%; 237) hits. Based on the blastx results, 168 and 69 SNPs were located in noncoding and coding regions, respectively, the latter corresponding to 40 synonymous and 29 nonsynonymous substitutions.

Genetic differentiation between Sea and Lake pools

Genetic differentiation, measured as G_ST, varied considerably (0–0.8933) among 1 228 SNP loci, which contained alleles counted for both pooled samples (Fig. 3). G_ST values were strongly affected by coverage – most of the SNPs with very high level of differentiation showed rather low coverage (< 50×), while loci with higher coverage showed similar or slightly higher levels of differentiation than individually genotyped microsatellite loci (for 17 microsatellites average G_ST is 0.016; 95% CI: 0.007–0.025; 24 individuals analysed per population; (Pukk et al. 2014)). Chi-square test identified loci with high levels of differentiation and low sampling error (e.g. high coverage in both pools) between Sea and Lake pools.

SNP validation

From 24 putative SNP markers selected for validation from pre-reanalysed data set (12 highly differentiated loci and 12 loci containing seemingly uncut PstI restriction site), twelve yielded positive PCR amplification products. Eleven of them represented highly differentiated markers while only a single fragment containing seemingly undigested PstI restriction site was successfully amplified. After cleaning, trimming and aligning the sequences, the presence of both homozygote and heterozygote genotypes could be confirmed in 6 of 12 markers while only homozygotes were observed at four SNP loci and the sequence quality was too low for reliable SNP calling at two loci (Table S2).

Discussion

In this study, we described an efficient genome complexity reduction protocol that uses two rare-cutting restriction enzymes and stringent size selection to make a RAD library compatible for sequencing with Ion Torrent semiconductor technology. We obtained a total of 4 M reads, which were de novo assembled into ~17 000 contigs predicted to cover 0.18% of the P. fluviatilis genome (assuming a genome size of 1.19 GB, based on Animal Genome Size Database; http://www.genomesize.com). Compared to earlier RAD sequencing studies carried out on a wide range of species, our RAD protocol enables very efficient genome complexity reduction (Fig. 4; Table S3, Supporting infomation). Therefore, our RAD protocol for the Ion Torrent is expected to be useful for studies which require sequencing of a smaller number of genomic regions, such as mapping QTL loci influencing variation in traits of interest, phylogenetic and population genomic analysis. Using two 6-cutter restriction enzymes (PstI and BamHI) in combination with stringent size selection, we were able to target a small proportion of the genome while retaining high sequencing depth allowing reliable SNP calling despite the higher sequencing errors of the Ion Torrent platform compared to currently available light-based sequencing approaches (Loman et al. 2012; Merriman et al. 2012; Bragg et al. 2013). Unlike earlier RAD sequencing studies, that typically suffer the risk of high DNA loss due to physical shearing and purification steps during agarose gel size selection, our protocol included a fast and simple, yet reproducible, narrow size selection strategy using an E-Gel^® SizeSelect^™ Agarose Gel. This enabled quick separation and collection of size fractions suitable for a particular sequencing platform, significantly reducing the overall library preparation time.

Using our RAD protocol, the mean distance between two contigs in the Eurasian perch is more than 75 Kb, and only a small number of published RAD studies have resulted in a similar level of genome complexity reduction (Fig. 4; Gagnaire et al. 2013; Hale et al. 2013; Ogden et al. 2013; Richards et al. 2013). All these studies have used a single restriction enzyme (SbfI) with an 8-bp recognition site (5′ CCTGCAGG 3′) and size selection of DNA fragments from 300 bp to 600–700 bp. Importantly, 8-cutter SbfI creates the identical four base pair overhangs (5′ TGCA 3′) compared to 6-cutter PstI used in this study and therefore, the same forward adapters can be used in relation to both restriction enzymes to target the optimal number of RAD loci to address a particular research question. As such, our protocol enables fast and flexible RAD sequencing and efficient genome complexity reduction with high coverage using semiconductor sequencing technology.

Eurasian perch, as the most common and widely distributed member of the Percidae family, contributes significantly to commercial and recreational fisheries in Estonia. Because of different fishing regulations, the ability to use genetic markers for assigning individuals back to their source population would be of considerable importance. Also, perch has a great potential for freshwater aquaculture (Rossi et al. 2007), where the use of molecular markers could be beneficial. However, not much is known about the P. fluviatilis genome as the genomic resources of the perch currently include a small number of mitochondrial and nuclear sequences. In this study, we obtained over 2.2 MB of novel sequence data for perch and fifteen per cent of all the consensus sequences gave significant blastx hits. As expected, the largest proportion of those hits came from the Nile tilapia O. niloticus (Perciformes). In total, over 1200 putative SNPs were identified and loci with large allele frequency differences between freshwater (Lake Peipus) and brackish water DNA pools (the Baltic Sea) were found, which could be used for the development of genotyping assays to correctly assign fish to populations of origin in the future.

Compared to the microsatellite markers (Pukk et al. 2014), DNA pooling resulted in similar or higher levels of genetic differentiation estimates for the high coverage SNPs (Fig. 3). On the other hand, the SNP markers with relatively low coverage (< 20–50×) showed inflated G_ST values, demonstrating the importance of high coverage for accurate estimation of genetic differentiation from DNA pools. However, SNP validation of a small subset of loci using Sanger sequencing indicated that not all putative SNP markers represent genuine polymorphisms and therefore, great care should be taken during development of SNP panels for identifying illegal fishing and fish trade. The other critical issue related to allelotyping is the importance of adding equal amount of DNA from each individual to the DNA pool. However, when using large number of individuals for generation of DNA pools, accurate allele frequency estimation can be achieved even with relatively unequal contributions of each individual forming the pool (Gautier et al. 2013a,b). The results of this study also revealed that majority of the loci with apparently uncut PstI and/or BamHI sites represent most likely chimeric sequences. Therefore, we recommend that future studies either exclude putative chimeric sequences before assembly or alternatively carry out in silico cutting of the fragments before subsequent analysis steps. Further optimization of the library protocols and developing alternative strategies to reduce the occurrence of chimeras would be useful.

In summary, this study represents an important step towards developing genomic resources and genetic tools for the Eurasian perch, which could be used in a court of law. Our RAD protocol enables very high genome complexity reduction by targeting a small number of genomic regions while retaining the high coverage necessary for robust SNP discovery and reliable estimation of various population genetic parameters. We expect that the described RAD sequencing protocol for semiconductor sequencing technology will be particularly useful for quick SNP discovery, testing different size selection procedures, restriction enzymes and other critical steps during library preparation. However, Ion Torrent PGM is currently not the cheapest available NGS technology in the market, which combined with rather low throughput and relatively high sequencing error rate makes it less suitable for larger genotyping-by-sequencing projects. Therefore, it is important that both the strengths and limitations of the approach are taken into account when choosing appropriate genome complexity reduction method and sequencing platform (Arnold et al. 2013; Andrews & Luikart 2014; Puritz et al. 2014).

Acknowledgements

We thank M. Lindqvist and O. Thalmann for technical assistance of the Ion Torrent sequencing, two reviewers and the Subject Editor Travis Glenn for the helpful feedback on an earlier version of the manuscript. We also thank the Finnish Center for Scientific Computing for providing computational resources. The study was supported by the Academy of Finland, Estonian Science Foundation (Grant No. 8215), the Estonian Ministry of Education and Research (institutional research funding project IUT8-2), the European Social Fund's Doctoral Studies and Internationalization Programme DoRa, the Performance Computing Centre of University of Tartu, Estonia, and the European Regional Development Fund through the Center of Excellence in Chemical Biology.

Supporting Information

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology, 215, 403–410.
10.1016/S0022-2836(05)80360-2
CAS PubMed Web of Science® Google Scholar
Altshuler D, Pollara VJ, Cowles CR et al. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 407, 513–516.
10.1038/35035083
CAS PubMed Web of Science® Google Scholar
Andrews KR, Luikart G (2014) Recent novel approaches for population genomics data analysis. Molecular Ecology, 23, 1661–1667.
10.1111/mec.12686
PubMed Web of Science® Google Scholar
Arnold B, Corbett-Detig RB, Hartl D, Bomblies K (2013) RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Molecular Ecology, 22, 3179–3190.
10.1111/mec.12276
CAS PubMed Web of Science® Google Scholar
Baird NA, Etter PD, Atwood TS, et al. (2008) Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers. PLoS ONE, 3, 7.
CAS Web of Science® Google Scholar
Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS (2007) SNP discovery via 454 transcriptome sequencing. Plant Journal, 51, 910–918.
10.1111/j.1365-313X.2007.03193.x
CAS PubMed Web of Science® Google Scholar
Baxter SW, Davey JW, Johnston JS et al. (2011) Linkage mapping and comparative genomics using next-generation RAD sequencing of a non-model organism. PLoS ONE, 6, e19315.
10.1371/journal.pone.0019315
CAS PubMed Web of Science® Google Scholar
Bragg LM, Stone G, Butler MK, Hugenholtz P, Tyson GW (2013) Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Computational Biology, 9, e1003031.
10.1371/journal.pcbi.1003031
CAS PubMed Web of Science® Google Scholar
Bruneaux M, Johnston SE, Herczeg G et al. (2013) Molecular evolutionary and population genomic analysis of the nine-spined stickleback using a modified restriction-site-associated DNA tag approach. Molecular Ecology, 22, 565–582.
10.1111/j.1365-294X.2012.05749.x
CAS PubMed Web of Science® Google Scholar
Chevreux B, Wetter T, Suhai S (1999) Genome sequence assembly using trace signals and additional sequence information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB), 99, 45–56.
Google Scholar
Davey JW, Hohenlohe PA, Etter PD, et al. (2011) Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Reviews Genetics, 12, 499–510.
10.1038/nrg3012
CAS PubMed Web of Science® Google Scholar
Emerson KJ, Merz CR, Catchen JM et al. (2010) Resolving postglacial phylogeography using high-throughput sequencing. Proceedings of the National Academy of Sciences, USA, 107, 16196–16200.
10.1073/pnas.1006538107
CAS PubMed Web of Science® Google Scholar
Gagnaire PA, Normandeau E, Pavey SA, Bernatchez L (2013) Mapping phenotypic, expression and transmission ratio distortion QTL using RAD markers in the Lake Whitefish (Coregonus clupeaformis). Molecular Ecology, 22, 3036–3048.
10.1111/mec.12127
CAS PubMed Web of Science® Google Scholar
Gautier M, Foucaud J, Gharbi K et al. (2013a) Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Molecular Ecology, 22, 3766–3779.
10.1111/mec.12360
CAS PubMed Web of Science® Google Scholar
Gautier M, Gharbi K, Cezard T et al. (2013b) The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Molecular Ecology, 22, 3165–3178.
10.1111/mec.12089
CAS PubMed Web of Science® Google Scholar
Günther T, Coop G (2013) Robust Identification of Local Adaptation from Allele Frequencies. Genetics, 195, 205–220.
10.1534/genetics.113.152462
PubMed Web of Science® Google Scholar
Hale MC, Thrower FP, Berntson EA, Miller MR, Nichols KM (2013) Evaluating adaptive divergence between migratory and nonmigratory ecotypes of a Salmonid Fish, Oncorhynchus mykiss. G3-Genes Genomes Genetics, 3, 1273–1285.
10.1534/g3.113.006817
CAS PubMed Web of Science® Google Scholar
Hall TA (1999) bioedit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symposium Series, 41, 95–98.
10.1007/s00299-001-0399-7
CAS Web of Science® Google Scholar
Hammer Ø, Harper DAT, Ryan PD (2001) past: paleontological statistics software package for education and data analysis. Palaeontologia Electronica, 4, 9.
Google Scholar
Hohenlohe PA, Bassham S, Etter PD, et al. (2010) Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genetics, 6, e1000862.
10.1371/journal.pgen.1000862
CAS PubMed Web of Science® Google Scholar
Hood GM (2011) poptools version 3.2.5. Available on the internet. URL http://www.poptools.org.
Google Scholar
Hyten DL, Cannon SB, Song Q et al. (2010) High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics, 11, 38.
10.1186/1471-2164-11-38
CAS PubMed Web of Science® Google Scholar
Kai W, Nomura K, Fujiwara A et al. (2014) A ddRAD-based genetic map and its integration with the genome assembly of Japanese eel (Anguilla japonica) provides insights into genome evolution after the teleost-specific genome duplication. BMC Genomics, 15, 16.
10.1186/1471-2164-15-233
CAS PubMed Web of Science® Google Scholar
Keller I, Wagner CE, Greuter L et al. (2013) Population genomic signatures of divergent adaptation, gene flow and hybrid speciation in the rapid radiation of Lake Victoria cichlid fishes. Molecular Ecology, 22, 2848–2863.
10.1111/mec.12083
CAS PubMed Web of Science® Google Scholar
Koressaar T, Remm M (2007) Enhancements and modifications of primer design program Primer3. Bioinformatics, 23, 1289–1291.
10.1093/bioinformatics/btm091
CAS PubMed Web of Science® Google Scholar
Li H, Handsaker B, Wysoker A et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079.
10.1093/bioinformatics/btp352
CAS PubMed Web of Science® Google Scholar
Loman NJ, Misra RV, Dallman TJ et al. (2012) Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotechnology, 30, 434–439.
10.1038/nbt.2198
CAS PubMed Web of Science® Google Scholar
Mascher M, Wu S, Amand PS, Stein N, Poland J (2013) Application of genotyping-by-sequencing on semiconductor sequencing platforms: a comparison of genetic and reference-based marker ordering in Barley. PLoS ONE, 8, e76925.
10.1371/journal.pone.0076925
CAS PubMed Web of Science® Google Scholar
Merriman B, Rothberg JM, Ion Torrent R, Team D (2012) Progress in ion torrent semiconductor chip based sequencing. Electrophoresis, 33, 3397–3417.
10.1002/elps.201200424
CAS PubMed Web of Science® Google Scholar
Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harbor Protocols, 2010, doi: 10.1101/pdb.prot5448.
10.1101/pdb.prot5448
Google Scholar
Milne I, Stephen G, Bayer M et al. (2013) Using tablet for visual exploration of second-generation sequencing data. Briefings in Bioinformatics, 14, 193–202.
10.1093/bib/bbs012
CAS PubMed Web of Science® Google Scholar
Moran MD (2003) Arguments for rejecting the sequential Bonferroni in ecological studies. Oikos, 100, 403–405.
10.1034/j.1600-0706.2003.12010.x
Web of Science® Google Scholar
Ogden R, Gharbi K, Mugue N et al. (2013) Sturgeon conservation genomics: SNP discovery and validation using RAD sequencing. Molecular Ecology, 22, 3112–3123.
10.1111/mec.12234
CAS PubMed Web of Science® Google Scholar
van Orsouw NJ, Hogers RCJ, Janssen A et al. (2007) Complexity reduction of polymorphic sequences (CRoPS (TM)): a novel approach for large-scale polymorphism discovery in complex genomes. PLoS ONE, 2, 10.
Google Scholar
Palmer LE, Rabinowicz PD, O'Shaughnessy AL et al. (2003) Maize genome sequencing by methylation filtration. Science, 302, 2115–2117.
10.1126/science.1091265
PubMed Web of Science® Google Scholar
Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE (2012) Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS ONE, 7, 11.
10.1371/journal.pone.0037135
PubMed Web of Science® Google Scholar
Pukk L, Kuparinen A, Järv L, Gross R, Vasemägi A (2013) Genetic and life-history changes associated with fisheries-induced population collapse. Evolutionary Applications, 6, 749–760.
10.1111/eva.12060
PubMed Web of Science® Google Scholar
Pukk L, Kisand V, Ahmad F, Gross R, Vasemägi A (2014) Double-restriction-site-associated DNA (dRAD) approach for fast microsatellite marker development in Eurasian perch (Perca fluviatilis L.). Conservation Genetics Resources, 6, 183–184.
10.1007/s12686-013-0042-2
Web of Science® Google Scholar
Puritz JB, Matz MV, Toonen RJ et al. (2014) Demystifying the RAD fad. Molecular Ecology, 23, 5937–5942.
10.1111/mec.12965
CAS PubMed Web of Science® Google Scholar
Richards PM, Liu MM, Lowe N et al. (2013) RAD-Seq derived markers flank the shell colour and banding loci of the Cepaea nemoralis supergene. Molecular Ecology, 22, 3077–3089.
10.1111/mec.12262
CAS PubMed Web of Science® Google Scholar
Rossi F, Chini V, Cattaneo AG et al. (2007) EST-based identification of genes expressed in perch (Perca fluviatilis, L.). Gene Expression, 14, 117–127.
10.3727/105221607783417600
CAS PubMed Web of Science® Google Scholar
Schmieder R, Edwards R (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27, 863–864.
10.1093/bioinformatics/btr026
CAS PubMed Web of Science® Google Scholar
Sharma R, Goossens B, Kun-Rodrigues C et al. (2012) Two different high throughput sequencing approaches identify thousands of de novo genomic markers for the genetically depleted Bornean elephant. PLoS ONE, 7, e49533.
10.1371/journal.pone.0049533
CAS PubMed Web of Science® Google Scholar
Shen PD, Wang WY, Chi AK et al. (2013) Multiplex target capture with double-stranded DNA probes. Genome Medicine, 5, 8.
10.1186/gm454
CAS PubMed Web of Science® Google Scholar
Smit AFA,HubleyR,GreenP (1996–2010) RepeatMasker Open-3.0., [http://www.repeatmasker.org]
Google Scholar
Untergasser A, Cutcutache I, Koressaar T et al. (2012) Primer3 - new capabilities and interfaces. Nucleic Acids Research, 40, 12.
10.1093/nar/gks596
CAS Web of Science® Google Scholar
Yang HA, Tao Y, Zheng ZQ et al. (2012) Application of next-generation sequencing for rapid marker development in molecular plant breeding: a case study on anthracnose disease resistance in Lupinus angustifolius L. BMC Genomics, 13, 11.
10.1186/1471-2164-13-318
CAS PubMed Web of Science® Google Scholar
Yuan YN, SanMiguel PJ, Bennetzen JL (2003) High-Cot sequence analysis of the maize genome. Plant Journal, 34, 249–255.
10.1046/j.1365-313X.2003.01716.x
CAS PubMed Web of Science® Google Scholar

A.V. designed the RAD protocol, L.P. and A.V. designed the study. L.P. conducted the laboratory work. A.V., L.P., F.A., S.H., R.G. and V.K. carried out bioinformatics analysis. L.P. wrote first draft of the manuscript. All authors contributed to the writing of the manuscript, read and approved the final manuscript.

Data Accessibility

Raw sequence data, Sanger sequence alignments, Python scripts and primer sequences are available on Dryad, doi:10.5061/dryad.s2405.

Table 1 SuppInfo.xlsb contains the annotated assembled contigs.

Table 2 SuppInfo.xlsb contains the filtered SNP data set.

Citing Literature

Volume15, Issue5

September 2015

Pages 1145-1152

Filename	Description
men12392-sup-0001-TableS1.xlsbMS Excel, 2.9 MB	Table S1 Mapped consensus sequences with BLASTN, BLASTX and RepeatMasker results.
men12392-sup-0002-TableS2.xlsbMS Excel, 387.6 KB	Table S2 Filtered putative SNPs with allele frequencies, Gst, BLASTN, BLASTX and RepeatMasker results.
men12392-sup-0003-TableS3.xlsbMS Excel, 22.7 KB	Table S3 Table of next-generation sequencing studies using restriction-site associated DNA approach that cite Baird et al. (2008).

Less is more: extreme genome complexity reduction with ddRAD using Ion Torrent semiconductor technology

Abstract

Introduction

Materials and methods

DNA isolation and ddRAD library preparation

Quality control, de novo assembly, mapping and SNP calling

Genetic differentiation

SNP validation

Contig annotation and categorization

Results

RAD tag sequencing, de novo assembly and mapping

Contig annotation and categorization

SNP identification and classification

Genetic differentiation between Sea and Lake pools

SNP validation

Discussion

Acknowledgements

Supporting Information

References

Data Accessibility

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Less is more: extreme genome complexity reduction with ddRAD using Ion Torrent semiconductor technology

Abstract

Introduction

Materials and methods

DNA isolation and ddRAD library preparation

Quality control, de novo assembly, mapping and SNP calling

Genetic differentiation

SNP validation

Contig annotation and categorization

Results

RAD tag sequencing, de novo assembly and mapping

Contig annotation and categorization

SNP identification and classification

Genetic differentiation between Sea and Lake pools

SNP validation

Discussion

Acknowledgements

Supporting Information

References

Data Accessibility

Citing Literature

Figures

References

Related

Information