Optimizing ddRAD sequencing for population genomic studies with ddgRADer
Abstract
Double-digest Restriction-site Associated DNA sequencing (ddRADseq) is widely used to generate genomic data for non-model organisms in evolutionary and ecological studies. Along with affordable paired-end sequencing, this method makes population genomic analyses more accessible. However, multiple factors should be considered when designing a ddRADseq experiment, which can be challenging for new users. The generated data often suffer from substantial read overlaps and adaptor contamination, severely reducing sequencing efficiency and affecting data quality. Here, we analyse diverse datasets from the literature and carry out controlled experiments to understand the effects of enzyme choice and size selection on sequencing efficiency. The empirical data reveal that size selection is imprecise and has limited efficacy. In certain scenarios, a substantial proportion of short fragments pass below the lower size-selection cut-off resulting in low sequencing efficiency. However, enzyme choice can considerably mitigate inadvertent inclusion of these shorter fragments. A simple model based on these experiments is implemented to predict the number of genomic fragments generated after digestion and size selection, number of SNPs genotyped, number of samples that can be multiplexed and the expected sequencing efficiency. We developed ddgRADer – http://ddgrader.haifa.ac.il/ – a user-friendly webtool and incorporated these calculations to aid in ddRADseq experimental design while optimizing sequencing efficiency. This tool can also be used for single enzyme protocols such as Genotyping-by-Sequencing. Given user-defined study goals, ddgRADer recommends enzyme pairs and allows users to compare and choose enzymes and size-selection criteria. ddgRADer improves the accessibility and ease of designing ddRADseq experiments and increases the probability of success of the first population genomic study conducted in labs with no prior experience in genomics.
1 INTRODUCTION
Restriction-site Associated DNA sequencing (RADseq; Baird et al., 2008) is a widely used method to generate reduced representation genomic libraries. This approach dramatically reduces the genotyping cost per sample compared to whole genome sequencing, thereby allowing a limited budget project to achieve larger sample sizes, which are the most important factor determining statistical power of population genomic analyses. RADseq has been particularly beneficial in SNP discovery and genotyping of non-model organisms for population, ecological, evolutionary and conservation genomic studies (Clugston et al., 2019; Davey et al., 2011; Eaton & Ree, 2013; Luikart et al., 2003; Wagner et al., 2013). The key advantage of this method is the flexibility it offers in the number of cut-sites that can be targeted across the genome, through the choice of appropriate restriction enzymes. This allows for a wide range of applications, from demographic surveys and population structure analyses that guide conservation efforts (e.g. Natesh et al., 2017; Zecherle et al., 2021), to high-resolution scans looking for the genomic basis of local adaptation (e.g. Magalhaes et al., 2021).
Numerous methods of generating reduced representation libraries are available that offer different advantages and help deal with different biases in data (reviewed in Andrews et al., 2016). Double-digest RAD sequencing (ddRADseq) is one of the most commonly used variations of RADseq, where genomic DNA is digested with two different restriction enzymes, generating genomic fragments in a reproducible manner (Peterson et al., 2012). Typically, a combination of a rare- and a frequent-cutting enzyme is used. The rare cutter determines the number of fragments sequenced and the frequent cutter determines the average length of these fragments. While the general rule of thumb is that enzymes that recognize shorter target sequences cut more often, the GC content in the recognition sequence also affects the number of cut sites. When choosing enzymes for a ddRADseq experiment, the size and GC content of the genome, and the polymorphism expected in the study population should be taken into consideration as they determine the number of polymorphic loci that will be genotyped, the depth per locus, and hence the number of samples that can be multiplexed in a single lane of sequencing. Taking these numerous factors into account can make designing a ddRADseq study challenging for beginners and easily result in a suboptimal study design even for experienced researchers. This can lead to substantial waste of sequencing effort. Moreover, it can generate sequencing data that is ill-fit for the downstream analyses and study goals (Puritz et al., 2014). Given the wide applicability of this method, it is crucial to make designing ddRADseq experiments more accessible and aid researchers in optimal experimental design.
Two of the biggest challenges in using ddRADseq are lack of appropriate training in designing experiments and the overall high cost of the project (compared to Sanger sequencing or microsatellites genotyping), despite the relatively low per-sample cost (compared to whole genome sequencing). Given the large capacity of a single lane of Illumina sequencing, most projects tend to conduct only one sequencing run on all samples together, multiplexed in one or a few sequencing lanes. Thus, unlike Sanger sequencing experiments, most researchers cannot afford sequential improvements to the experimental design. These challenges are particularly consequential for researchers in developing regions, including the tropics where most of the biodiversity lies and conservation efforts are needed. Several tools are available for processing of ddRADseq data and SNP calling (e.g. Catchen et al., 2011, 2013; Eaton & Overcast, 2020; Nadukkalam Ravindran et al., 2019; Puritz et al., 2014; Sovic et al., 2015). However, these applications are meaningful only when the sequence data are generated with an appropriate experimental design. Relatively, fewer tools can help in designing and optimizing the experimental design (but see Lepais & Weir, 2014; Mora-Márquez et al., 2017; Rivera-Colón et al., 2021). While these tools may help in enzyme choice and size-selection, they are based only on predictions generated in silico that may be unrealistic.
With the advent of Illumina paired-end sequencing, where reads are sequenced from both ends of a DNA fragment, twice the amount of data can be generated at a relatively low additional cost compared to single-end sequencing. Thus, paired-end sequencing is nowadays the common practice. When carrying out paired-end sequencing of 150 bp reads, if the DNA fragment (insert size) is less than 300 bp (two read lengths), the ends of the reads overlap, and some genomic positions are sequenced twice. If the insert size is less than 150 bp (one read length), the adaptors are inadvertently sequenced leading to adaptor contamination at the end of each read, in addition to complete overlap. Any sequenced fragments shorter than two read lengths lead to considerable loss of sequencing effort apart from a dataset contaminated with adaptors. RADseq protocols include measures to remove short fragments using magnetic beads and gel-based size selection (e.g. the BluePippin instrument). The magnetic beads selectively bind to DNA fragments typically between 200 and 500 bp, but they do not completely exclude short fragments. It is possible to alter the bead-to-sample ratio to shift the minimum size threshold up or down, but this flexibility is rather limited compared to gel-based size selection. When using only magnetic beads for size selection, a previous study in our lab (Inbar et al., in prep) resulted in 74% of the sequenced read pairs having overlaps (i.e. fragments shorter than two read lengths). As a result, 27% of sequenced bases were wasted, that is, sequencing efficiency of only 73%. While the data were adequate to address the question in hand, this massive waste of sequencing effort motivated us to conduct a methodological study to optimize our sequencing efforts.
In this paper, we describe a user-friendly webtool that we developed based on our methodological insights, to assist researchers in choosing enzyme pairs and size-selection criteria while minimizing waste due to short fragments. The tool allows researchers to digest their genome of interest in silico, compare enzyme combinations and size-selection cut-offs, calculate the number of samples to be multiplexed and predict the extent of adaptor contamination and read overlaps. The development of this tool was based on our work to optimize ddRADseq experimental design by minimizing adaptor contamination and overlapping reads. We conducted a literature survey to look at the common choices of enzyme combinations, the size-selection methods and cut-offs used in ddRADseq studies. We re-analysed 12 of these datasets to understand the relationship between empirical adaptor contamination and read overlaps, and the theoretical expectations from in silico digestion. We conducted controlled experiments to test switching the frequent cutting enzyme to an enzyme with less cut sites, to increase the fragment sizes and hence reduce the waste and improve sequencing efficiency. This led to insights into the efficacy of size selection, allowing more realistic predictions for the implications of alternative enzyme choices and size selection cut-offs. We hope that the insights from this study and their implementation in a user-friendly webtool will aid inexperienced researchers to maximize their ddRADseq sequencing efficiency and optimize their experimental design.
1.1 Webtool: ddgRADer
ddgRADer (http://ddgrader.haifa.ac.il/) was built to optimize sequencing efficiency by aiding enzyme and size-selection choices. It was implemented as a webtool to make designing ddRADseq experiments more accessible. This tool can also be used for other RADseq protocols that involve one or two enzymes and no random shearing such as Genotyping-by-Sequencing (GBS; Elshire et al., 2011) or ezRAD (Toonen et al., 2013).
ddgRADer works on a user-provided reference genome along with some additional basic information. The genome is in silico digested (python scripts ‘HandleFastafile.py’ and ‘DigestSequence.py’) with either a predetermined list of 14 commonly used enzyme pairs or enzyme combinations chosen by the user from a comprehensive list (see the two usage modes ‘I have no idea!’ and ‘Lemme try out!’ in Figure 1). Based on the nature of the question being addressed, the desired number of SNPs to be genotyped (e.g. to carry out demographic analysis) or the desired SNP density (for a genomic scan analysis) are specified by the user. The user also enters an estimate of the expected SNP density in the study population, which is needed to predict the number of SNPs that will be genotyped by ddRADseq (the user guide provides some examples for reference). Additional input information includes read length, single- or paired-end sequencing, expected yield from one sequencing lane, and the desired average sequencing depth for each sample. Details of the ddgRADer code are provided in the README file at https://github.com/felixglinka/ddRadSeqWebTool/blob/development/README.md

The in silico digestion results in the expected number of fragments in each fragment size bin that have cut sites of both the enzymes on either end. The predicted fragment size distributions from digestion by different enzyme pairs are plotted. Further outputs are given (by JavaScript code ‘dataFrame.js’) as an interactive table where the user can try out different size selection cut-offs for each of the enzyme combinations to compare the number of SNPs in the digested sample, the number of samples that can be multiplexed and the expected sequencing efficiency (see Figure 1). Sequencing efficiency is calculated as the percentage of sequenced bases that is not wasted on overlaps or adaptor contamination.
In addition to these theoretical values, the table also lists values predicted by models based on empirical data from the meta-analysis and our controlled experiment (see Figure 3 in Section 3), incorporating the effect of incomplete size selection. This is done using two linear models based on a linear regression analysis, one for read overlaps and one for adaptor contamination (Figure 3; details provided in Sections 2 and 3). These estimates cannot be taken as precise predictions because of the large variation we observed across the datasets included in the meta-analysis. They are meant to be suggestive of the incomplete effectiveness of size selection, and therefore, the expected sequencing efficiency. They should only be taken as a cautionary indication of how unrealistic the theoretical predictions might be.
ddgRADer can also be used to visualize in silico digestion by a single enzyme by choosing the same enzyme twice. This might be relevant for methods such as genotyping-by-sequencing (GBS) that involve digestion by a single enzyme. However, it cannot be used for protocols that involve further random shearing of the DNA fragments generated after digestion, such as single-digest RAD sequencing. The workflow of ddgRADer is shown in Figure 1 and full details of the internal calculations and predictions are given in Table S3. The scripts used in ddgRADer are available on github at https://github.com/felixglinka/ddRadSeqWebTool.
2 MATERIALS AND METHODS
2.1 Literature survey
A literature survey was first carried out to review the diversity of enzyme pairs and size-selection cut-offs used in ddRADseq studies and estimate the proportion of short fragments resulting from these choices. To keep the number of papers manageable, we limited our search to papers published in Molecular Ecology and Molecular Ecology Resources as these journals routinely publish studies involving ddRADseq. The keywords ‘ddRAD’ and ‘paired-end’ were used for the search in the software Publish or Perish (PoP; Harzing, 2007), which allowed us to get the data in a tabular format. The enzymes used, size-selection method and cut-offs, the stage in which size selection was performed (before or after PCR) and read length were noted. We then downloaded the raw sequencing data from NCBI or dryad and re-analysed them for these studies whenever possible. Most of the published datasets could not be used because of the following reasons: there was no closely related reference genome available for the study species, the sequencing data was not made public, only forward reads were published, filtered sequencing data was published or mismatched number of reads in the forward and reverse files. This resulted in 12 datasets which were further analysed (Baiz et al., 2019; Combs et al., 2018; de Jong et al., 2020; Farleigh et al., 2021; Fritz et al., 2018; Ivanov et al., 2018; Maigret et al., 2020; Portnoy et al., 2015; Ryan et al., 2017; Schley et al., 2020; Termignoni-García et al., 2017; Trense et al., 2021) where size-selection was carried out using Pippin Prep or BluePippin before the PCR step.
We used FastQC (Andrews, 2010) to estimate the level of adaptor contamination in each of the datasets. Adaptors were then trimmed using Trimmomatic (Bolger et al., 2014) with the default settings, which are to identify a minimum of 8 bp adaptor sequence. Sequences were not filtered or trimmed for low quality. FLASH (Magoč & Salzberg, 2011) was used to combine the trimmed overlapping paired-reads and get a fragment size distribution of the combined reads. The output of FLASH was a histogram of all the combined reads that was then used to find the proportion of read overlap and adaptor contamination in the data out of the total number of reads in the empirical datasets. We obtained the theoretically predicted fragment size distribution from the in silico digestion of the closest available reference genome using a custom python scripts (python scripts ‘HandleFastafile.py’ and ‘DigestSequence.py’, available in the ddgRADer github link mentioned above). Based on this, we calculated the proportion of fragments (out of total number of fragments shorter than 1000 bp) that were below two read lengths for the respective dataset.
To test whether in silico prediction and size-selection criteria significantly predicted the empirical outcome of adaptor contamination and overlapping reads, we performed two multiple linear regressions, one for adaptor contamination (fragments shorter than one read length) and another for read overlaps (between one and two read lengths). The empirical proportion of adaptor contamination (or overlaps) was the response variable, with two predictor variables: in silico proportion of adaptor contamination (or overlaps) and the width of the size-selection range. Note that the in silico proportion is the proportion of short fragments in the in silico digestion of the genome, before any size selection. We also tested for an interaction between these two predictors. Datasets where the size selection cut-off overlapped with the adaptor contamination range or read overlap range were excluded for the respective analyses. This was necessary because the cut-off reported in the papers often did not match with the one observed in the empirical data, and it was difficult to determine the precise cut-off based on visualizing the fragment size distributions. We further excluded the study in Figure S1l as the empirical results suggested that size-selection failed entirely. This resulted in a total of 12 datasets that were used for the adaptor contamination regression (10 of 12 datasets from the literature review along with two datasets from our controlled experiment; Figure S1l and Figure 2d were excluded), and eight datasets were used for the read overlap model (eight of 12 datasets from the literature review; Figure 2b–f and Figure S1l were excluded).
2.2 Controlled experiment to study the effect of enzyme choice
We conducted a controlled experiment on Camponotus carpenter ants to understand the effect of changing the frequent-cutting enzyme on the proportion of short fragments and consequently on sequencing efficiency.
2.2.1 In silico digestion and size-selection cut-offs
The Camponotus floridanus reference genome (NCBI accession QANI01000000) was in silico digested to compare the fragment size distributions generated by both pair of enzymes as described below. To estimate the number of genomic sites that were expected to be sequenced at different size-selection cut-offs, taking into account the waste due to adaptor contamination and read overlaps, an excel sheet calculator was created (Appendix S1). Additionally, we estimated the number of samples that could be multiplexed with a given yield of an Illumina sequencing lane and depth requirement of the study. These calculations were subsequently implemented in the webtool ddgRADer and described in Table S3.
2.2.2 Sampling, library construction and sequencing
Samples of 125 male Camponotus fellah were collected from Rehovot, Israel (31°54′32.1″ N 34°48′45.7″ E) from a single nest and stored at −80°C. DNA was extracted from the abdomen of the ants using the Qiagen DNeasy blood and tissue kit followed by ethanol precipitation of the DNA extracts. Genomic libraries were constructed following Brelsford et al. (2016), a modified double-digest Restriction-site Associated DNA sequencing (ddRADseq) protocol based on Parchman et al. (2012) and Peterson et al. (2012) (Appendix S2). To understand the effect of enzyme choice on the resulting fragment size distribution and the proportion of waste due to short fragments, we compared libraries built using different restriction enzyme pairs. One way to switch enzymes where the same Illumina sequencing adaptors can be used, is to switch to an enzyme that generates the same 3′ overhang. One set of libraries was constructed using the commonly used combination of EcoRI and MseI. EcoRI is a 6-base rare cutter (recognition sequence 5′-G|AATTC-3′) while MseI is a 4-base frequent cutter (recognition sequence 5′-T|TAA-3′). Given that the recognition site for MseI is GC poor, for a genome with <50% GC content, MseI cuts frequently, generating a large proportion of small fragments. BfaI is a 4-base frequent cutter generating an overhang identical to MseI, yet its recognition site contains more GC than MseI (recognition sequence 5′-C|TAG-3′). Therefore, the GC-poor genome of C. fellah contains less BfaI than MseI sites, which should increase the fragment sizes and improve sequencing efficiency (in general, ant genomes contain between 30%–50% GC (Simola et al., 2013), and the genome of C. floridanus has only 33% GC).
Briefly, genomic DNA was digested using two restriction enzymes and adaptors containing unique barcodes were ligated to each sample for multiplexing (using CutSmart buffer® NEB). The ligated products were PCR amplified in replicates of four using Q5 Hot Start Polymerase® NEB for 20 cycles with a starting DNA volume increased to 7ul per reaction. Additionally, primers and dNTPs were added to a final thermal cycle step to reduce single-stranded or heteroduplex PCR products. All the samples for each pair of enzymes were pooled and the two sets of libraries were size selected using BluePippin (Sage Science). One hundred twenty-five samples were used for MseI/EcoRI and 117 of them were also used for building BfaI/EcoRI libraries. The BfaI/EcoRI libraries were selected for a range of 330–700 bp and MseI/EcoRI for a range of 300–700 bp (this size includes 128 bp of the adaptors and primers). The rationale behind choosing different size-selection ranges are discussed in detail in the Results and discussion section below. The libraries produced with the two sets of enzymes were sequenced with paired end 150 bp reads on a single lane of an illumina Hiseq X sequencer to generate 340 million paired end reads (NCBI Bioproject ID: PRJNA999052).
2.3 ddRADseq data analyses
The quality of the raw sequence data was examined using FastQC. The mean phred quality score was 37.9 and more than 90% of the reads had a mean score >= 30. The total number of paired-reads in the BfaI/EcoRI raw data were 143,893,548 (117 samples) while in MseI/EcoRI were 143,444,471 (125 samples). We removed the excess samples from the MseI/EcoRI dataset to have a comparable set of samples in the two datasets. The BfaI raw data was then subsampled (randomly excluded reads) to have a comparable number of reads in both the datasets. These two datasets were then analysed similar to literature survey datasets using Trimmomatic and FLASH. The command process_radtags.pl from the Stacks2 pipeline (Rochette et al., 2019) was used to demultiplex the data. The resulting reads were mapped to the Camponotus floridanus genome (NCBI accession number: GCA_003227725.1; Shields et al., 2018) using Bowtie2 (Langmead & Salzberg, 2012). Using Ref_map.pl from stacks2 the mapped sequences were further processed to call SNPs. The details on the number of reads at each step are given in Table S1.
3 RESULTS AND DISCUSSION
Adaptor contamination and read overlaps are major factors reducing the efficiency of sequencing. For example, in a previous study, in our lab (Inbar et al., in prep) we constructed ddRADseq libraries using samples of the ant Cataglyphis niger, with the enzyme combination EcoRI/MseI. This enzyme choice resulted in read overlaps in 74% of read pairs (fragments shorter than two read lengths), of which 16% had adaptor contamination (fragments shorter than one read length; see Figure 2a). The fact that we only see adaptor contamination in 16% was due to the bead clean-up step in the library construction protocol, which excludes a significant proportion of short fragments. And yet, 27% of the sequencing yield in this study was wasted on adaptor contamination and read overlaps. Another way to put it is that the sequencing efficiency was only 73% in this study, where efficiency is defined as the percentage of sequenced bases that is not wasted on overlaps or adaptor contamination. Many researchers try to deal with this issue using a more precise size selection method, such as manually cutting fragments of a certain size range from a gel or using instruments like Pippin Prep, to exclude short fragments from the library.

We carried out a literature survey to understand the range of size selection criteria and commonly used enzymes in ddRADseq protocols. We reviewed a total of 66 studies published in the journals Molecular Ecology and Molecular Ecology Resources in the years 2015–2021 where ddRADseq data were generated using paired-end sequencing (Appendix S3). Of these, 38 studies (58%) used some form of automated gel electrophoresis-based size selection, such as BluePippin or Pippin Prep, 10 studies (15%) used only magnetic beads for size selection (nine Ampure XP and one Sera-mag magnetic beads), five studies (8%) used manual size selection from a gel, one study used Chroma spin column and 12 studies (20%) did not report the size selection method used. Ten studies (15%) chose a size selection range with the lower cut-off below two read lengths, intentionally including overlapping read pairs. Most studies used a broad size selection range that was wider than 100 bp (40 studies: 61%). Among the enzyme pairs, 33 studies (50%) used a combination of 6-base and 4-base cutter, 14 studies (21%) used 8-base and 4-base cutter, 10 studies (15%) used two 6-base cutters, 7 studies (11%) used two 4-base cutters and 2 studies (3%) used 8-base and 6-base cutter. EcoRI/MspI was the most used enzyme pair (10), followed by EcoRI/SphI (8), MluCI/NlaIII (6), SbfI/MseI (6), SbfI/MspI (6) and SphI/MluCI (5) (Table S2). Although all studies reported the enzymes used in the ddRADseq protocol, key details necessary to evaluate the experimental design were missing in some of the studies. For example, 12 studies (20%) failed to mention the size-selection method used in the protocol, which led us to eliminate them from the subsequent meta-analysis.
We then conducted a meta-analysis, reanalysing and comparing 12 ddRADseq datasets from diverse organisms to evaluate the extent of adaptor contamination and read overlaps. These 12 studies generated highly diverse datasets demonstrating a range of possible outcomes from ddRADseq (see representatives in Figure 2d–i; full results in Figure S1), which depend on a combination of factors in the experimental design, including the genome of the study organism, the choice of restriction enzymes, size selection method and cut-offs, and other technical variations in the protocol. Therefore, we conducted a controlled comparative study to examine the effect of alternative enzyme choices on the resulting adaptor contamination, overlapping reads, and consequently, efficiency of sequencing. We generated two sets of libraries for the same set of samples with two different frequently cutting enzymes. Both sets of libraries were built using the same rare cutter (EcoRI), but the frequent cutter (MseI) was switched to an enzyme with a more GC rich recognition site (BfaI), because of which it cut the genome less frequently. Thereby, we increased the average fragment size in the digestion products (Figure 2b,c).
The first major factor differentiating all these studies was the in silico predicted distribution of fragment sizes. The results of the meta-analyses and our controlled experiment revealed high variation in the shape of the in silico distributions, dramatically differing in their skewness. Studies that used two 4-base cutters or a combination of 4- and 6- base cutters had most of the expected distribution in the size range of one read length, which would result in high levels of adaptor contamination. For example, Figure 2e (using NlaIII/MluCI, both 4-base cutters) and Figure 2b (our EcoRI/MseI experiment, 6- and 4-base cutters) had 85% and 71% of the in silico distribution below one read length, respectively. This skewness was mitigated using a 4-base cutter that had high GC content, for example, Figure 2c (EcoRI/BfaI with 50% GC in BfaI) and Figure 2f (EcoRI/MspI with 100% GC in MspI) had 28% and 47% in silico adaptor contamination, respectively. Using two 6-base cutters (Figure 2h) or an 8- and 4- base cutter (Figure S1k) also resulted in similarly lower proportions of fragments below one read length, that is, 25% and 21%, respectively. These results show that the initial fragment size distribution, before any size selection, is highly dependent on the length and the GC content of the enzyme recognition site.
We further observed considerable variation in the efficacy of size selection in eliminating short fragments. In theory, one might expect no adaptor contamination and read overlaps if the size selection cut-off is larger than two read lengths. However, in practice, we observed a large proportion of sequencing reads with overlap and adaptor contamination in some of the studies. For example, the study in Figure 2g chose a size selection cut-off that was bigger than two read lengths, and yet 10% of the sequencing output had read overlaps and 2.1% had adaptor contamination. Some studies intentionally allowed read overlaps but had a size selection cut-off just bigger than one read length. In these cases, although adaptor contamination was not expected, large proportions of adaptors were observed (e.g. 13% in Figure 2f). These results demonstrate that a large proportion of short fragments might pass through the size selection procedure, even when using the most precise instruments available. This issue is worse when the in silico distribution is more skewed, that is, having a larger proportion of short fragments before size selection. This effect explains the difference observed in our controlled experiment, with 22% adaptor contamination in Figure 2b (EcoRI/MseI) and only 5% in Figure 2c (EcoRI/BfaI), corresponding to the difference in their in silico proportions of 71% and 28%, respectively.
The second major factor differentiating the ddRADseq studies was the width of the size selection window (ranging from 36 to 520 bp). Half of the studies used a narrow size selection range (the minimum width possible under ‘tight mode’, which corresponds to a target size ±16% according to the 2018 BluePippin manual). Unlike the studies that used a broad range, this approach effectively resulted in negligible waste of sequencing due to adaptor contamination and read overlap (Figure 2g–i, Figure S1k,n,o). Using a narrow size selection window also mitigated the effect of poor enzyme choice. For example, using a combination of 6- and 4- base cutters with low GC content in Figure 2g (311–347 bp) and Figure 2i (205–265 bp) resulted in 2% and 0% adaptor contamination and 10% and 1% read overlap, respectively. However, its drawback was that a relatively small number of genomic loci were sequenced, typically resulting in a large proportion of sequence duplication, which is another kind of waste (data not shown).
Size selection not only leads to variable outcomes, but it also appears to be imprecise. In our controlled experiments (Figure 2b,c), we see a shift of about 50 bp between the intended size selection range and the outcome in the empirical data. A similar shift of about 40 bp can be observed in Figure 2d. Despite these shortcomings of size selection, we see a considerable effect of size selection in enriching fragments within the desired range. For studies shown in Figure 2b–f, size selection cut-offs were below two read lengths, which allows us to observe the proportion of the empirical distribution within the size selection range. Size selection in these studies resulted in a much larger proportion of fragments in the selected range, higher than that expected based on the in silico distribution, indicating that size selection was successful in enriching target fragment sizes.
We also observed an effect of the distance between the size selection cut-off and the read length on the efficacy of size selection. For example, both studies in Figure 2f and Figure 2g had a similar proportion of predicted fragments below one read length (46% vs. 47%). In Figure 2g, the size selection range begins above two read lengths while in Figure 2f the lower cut-off is just larger than one read length. This resulted in only 2% adaptor contamination in Figure 2g compared to 13% in Figure 2f, suggesting that as the lower cut-off comes closer to the one read length line, adaptor contamination increases. This might be due to the tendency of some short fragments to migrate with slightly longer fragments in gel electrophoresis. However, choosing a high selection cut-off might not necessarily be the optimal choice, because it might dramatically reduce the number of loci in the sequencing output.

One of the differences between our two controlled experiments, other than the frequently cutting enzyme, was the size selection cut-off. In the EcoRI/MseI experiment, the in silico distribution predicted close to 90% of the fragments to be below 300 bp compared to only 48% in EcoRI/BfaI. While selecting a cut-off above 300 bp would result in efficient sequencing with low wastage, this would also exclude ~70% of the genomic sites from being sequenced (only 5 million out of a potential 18 million genomic sites would be sequenced for EcoRI/MseI; Figure 4). Therefore, we considered the trade-off between genomic sites sequenced and sequencing efficiency, defined as the proportion of sequenced bases not wasted due to adaptor contamination or read overlaps (see Section 2 for details). As the lower cut-off moves below 300 bp, we allow some read overlaps and sequencing efficiency drops. This drop is more pronounced for MseI than BfaI (dashed lines in Figure 4) in line with their different in silico distributions. Hence, different cut-offs were chosen for the two experiments to optimize this trade-off: 170 bp for EcoRI/MseI and 200 bp for EcoRI/BfaI.

Theoretically, we expected an efficiency of 92% in EcoRI/MseI (for the observed sharp drop in the empirical distribution at 210 bp in Figure 2g) and 98% in EcoRI/BfaI (at 240 bp in Figure 2h) based on the excel sheet calculator (Appendix S1). However, due to poor efficiency of size selection that allows unintended shorter fragments to go through, we observed lower sequencing efficiency. Moreover, the considerably lower sequencing efficiency of EcoRI/MseI relative to EcoRI/BfaI is a reflection of the larger number of short fragments produced by EcoRI/MseI compared to EcoRI/BfaI as predicted from the in silico distribution. The empirical data from the controlled experiment revealed a sequencing waste of 27% (sequencing efficiency of 73%) in EcoRI/MseI compared to 2.7% in EcoRI/BfaI (sequencing efficiency of 97%) as a result of adaptor contamination and read overlaps.
Interestingly, we observe higher variance of sequencing depth for EcoRI/MseI compared to EcoRI/BfaI (Figure 5a). Accordingly, if we do not set any threshold on the minimum depth, we have 8,003,694 unique genomic sites sequenced with EcoRI/MseI compared to 5,404,131 with EcoRI/BfaI. However, as in common practice, where we do require a minimum depth of 10X (the --minDP parameter in vcftools), less genomic sites are usable in EcoRI/MseI (513,891) compared to EcoRI/BfaI (737,766) (Figure 5b). Therefore, the enzyme choices that give a more uniform fragment size distribution ultimately result in a larger number of genomic sites that can be genotyped (for a given sequencing budget, and with a commonly used depth cut-off).

The insights from the above analyses were implemented in the webtool ddgRADer to aid researchers in optimizing ddRADseq experimental design (described in the beginning of this manuscript). A few previously published tools were already available for researchers. SimRAD (Lepais & Weir, 2014) is designed to choose between different reduced representation library protocols by predicting the number of loci that can be generated. It performs in silico digestion and fragment size-selection allowing the user to choose the enzyme and size-selection criteria. Another software that carries out a similar analysis is DDRADSEQTOOLS (Mora-Márquez et al., 2017). In addition to in silico digestion, this tool also simulates the effect of allele dropout and PCR duplicates on the coverage taking into account potential sources of error. Both simRAD and DDRADSEQTOOLS are R packages. RADinitio (Rivera-Colón et al., 2021) is a python software that simulates population-level data based on a demographic model input by the user over a reference sequence. The user can implement various reduced representation protocols, in silico digestion, size selection, library amplification, and sequencing depth, while considering errors such as allelic dropout. Additionally, RADinitio carries out retrospective simulation, where an in silico library is generated based on an empirical RADseq dataset. Thus, RADinitio takes into account a lot more variables, particularly inherent variability in natural populations that contribute to the quality of the data. While some of these tools help optimize experiments by taking into account allele dropouts and PCR duplication, none of them help in making the experiment cost effective by optimizing the sequencing efficiency. Moreover, they assume perfect size-selection and do not take into account its incomplete effectiveness, which results in additional adaptor contamination and read overlaps. This is an important factor to consider when selecting the enzyme pair because in certain cases it has a dramatic effect on sequencing efficiency, the expected number of SNPs, the sequencing depth, and the number of samples that can be multiplexed per lane. Additionally, all these software are command-line tools and might require some skill in R, which would involve a steep learning curve particularly for researchers without prior experience with command-based programs.
The easy-to-use graphical user interface of ddgRADer makes it accessible to even a beginner, considerably reducing the time taken to design ddRADseq experiments. Further, multiple popovers guide the user through the process and additional documentation is included to help in decision making. Thereby, this tool can aid researchers that do not yet have experience or in-depth understanding of ddRADseq protocols and the associated issues. ddgRADer explicitly incorporates the error associated with size-selection by accounting for additional shorter fragments that result in adaptor contamination and read overlaps. This helps the user make informed enzyme choices taking into account sequencing efficiency. Although enzymes aren't very expensive, synthesizing the multiplexing barcodes requires a large initial investment. These barcodes attach to the overhang generated by the restriction enzyme making enzyme choice a consequential decision. Additionally, ddgRADer allows the user to compare several enzyme pairs and the size-selection cut-offs that best suit each of these pairs. The interactive size-selection slide bar helps the user understand the trade-off between the number of genomic sites in the digested sample and sequencing efficiency, as a function of different size-selection cut-offs.
4 CONCLUSION
Here, we demonstrate using empirical data and controlled experiments that enzyme choice greatly affects inadvertent inclusion of adaptor contamination and read overlaps, which can result in substantial waste of sequencing effort. This effect can be mitigated using a narrow size-selection range. However, we observed that the efficacy of size-selection is variable across studies and can be imprecise. Therefore, enzyme choice is crucial in determining sequencing efficiency and optimizing ddRADseq experimental design. We provide ddgRADer, a user-friendly webtool that guides beginners through considering alternative enzyme choices and optimizing size-selection. The tool provides realistic predictions about the expected outcomes from ddRADseq, based on our empirical characterization of size-selection efficacy in a meta-analysis across diverse studies from the literature as well as our controlled experiments. ddRADseq is a powerful and efficient tool to address problems in evolutionary and conservation biology, when using an appropriate experimental design. This new webtool will make ddRADseq more accessible and facilitate the smooth transition of more researchers into the field of population genomics.
ACKNOWLEDGEMENTS
We thank Jessica Purcell and Alan Brelsford for advice on ddRADseq methodology, Pnina Cohen for help in sample collection, and Viraj Torsekar for discussions and advice on statistical analyses. AL thanks Praveen Karanth for providing lab space. We are grateful to the reviewers for their insightful comments that improved the manuscript considerably. This study was funded by US-Israel Binational Science Foundation (Grant no. 2017319).
CONFLICT OF INTEREST STATEMENT
The authors have declared no conflict of interest for this article.
BENEFIT-SHARING STATEMENT
This is not applicable.
Open Research
DATA AVAILABILITY STATEMENT
Genetic data: Raw sequence reads have been deposited in the Sequence Read Archive (BioProject PRJNA999052; Lajmi et al., 2022).