Enriching barcoding markers in environmental samples utilizing a phylogenetic probe design: Insights from mock communities
Cristiano Vernesi and Laura Parducci shares joint senior authorship.
Abstract
Hybridization capture is an emerging method making use of short oligonucleotide baits to enrich DNA libraries for genomic fragments of specific organisms thus enabling detection of their presence in environmental samples. Although it offers a primer-independent alternative to metabarcoding, little empirical work has been dedicated to characterizing the underlying biases and coupled implications for biological interpretation. Moreover, few published bioinformatic pipelines are available for designing polynucleotide capture baits from a reference sequence collection. We designed RNA-baits specifically targeting two chloroplast barcoding genes matK and rbcL to reveal the plant taxonomic diversity present in a given environmental sample. Our approach leverages the sensitivity of hybridization capture and the capacity of high-throughput DNA sequencing instruments. It builds on a new and universal method based on ancestral sequence reconstruction, ultimately limiting the number of bait-probes required and reducing experimental costs, while accessing high levels of taxonomic diversity. Our bait-set selectively targets four main plant orders (Fagales, Pinales, Asterales, and Poales), representing ~18% of all described vascular plants. This is achieved through the use of only 4084 baits, each 80 nucleotides in length (80-mer), capturing ~1.0–1.6 k nucleotide sequences from each taxon. Tests on mock communities revealed important factors influencing capture efficiency and relative abundance estimates, including GC-content, the overall target length per taxa, and the bait density and mean number of mismatches to the bait sequence. Our results show that hybridization capture, like metabarcoding, requires caution when interpreting results quantitatively within (paleo)-ecological studies. Biases detected in this work have the potential to be mitigated with bait designs that avoid extreme base compositional biases and balancing bait targets across taxa. However, we strongly recommend the use of mock communities and read simulations to quantify the accuracy of taxonomic representation when using new bait designs.
1 INTRODUCTION
Over the last decade, environmental DNA (eDNA) has emerged as a new and powerful tool for cost- and time-effective characterization of the biological diversity present in a given ecosystem community (Ruppert et al., 2019; Taberlet et al., 2018). The molecular method that first revolutionized the field is metabarcoding, and still represents the most widely applied approach. Metabarcoding is based on PCR amplification and high-throughput sequencing of short regions of a gene (i.e., amplicons), whose sequence acts as a diagnostic (mini)-barcode for taxonomic identification (Meusnier et al., 2008). Taxonomic assignments are either based on sequence comparison to a reference database or alternatively, especially for microorganisms, on pre-clustering of reads into mOTUs (molecular operational taxonomic units) or ASVs (amplicon sequence variants) (Eren et al., 2013; Taberlet et al., 2012). Although no universal standards have been proposed, the underlying laboratory and bioinformatic toolkit for metabarcoding is relatively well established and streamlined, with many published laboratory protocols and bioinformatic pipelines and workflows (Taberlet et al., 2018). Important biases and limitations are fairly well characterized, and extensively studied (Nichols et al., 2018; Rodriguez-Martinez et al., 2022; Zinger et al., 2019), mostly in relation to PCR amplification, which can selectively over- or under-amplify certain alleles due to reduced polymerase fidelity, variable amplicon lengths, polymerase processivity linked to GC-content (Nichols et al., 2018) and formation of secondary structures of the targets (steric hindrance). The marker amplicon length is another potential hindrance, particularly reducing PCR success when analyzing environmental samples where DNA tends to be highly fragmented and degraded (e.g., ancient DNA). There is a trade-off between amplification success (and therefore taxonomic coverage) and the length and diagnostic value (i.e., taxonomic resolution) of the amplified region; that is, DNA fragments too short to accommodate both primer binding sites will not be amplified. Despite such limitations, metabarcoding remains a powerful and widely used method due to the increasing availability, and in some cases, relatively complete reference databases, low sequencing and laboratory costs, and general sensitivity (Taberlet et al., 2018).
Shotgun metagenome sequencing is an alternative method that aims to sequence all fragments in an eDNA sample using either single or double-stranded library preparations (Gansauge et al., 2017; Meyer & Kircher, 2010). The approach is very powerful; and for example, it has revealed the ecological diversity of microbial, plant, and animal diversity up to 2 million year (Fernandez-Guerra et al., 2023; Kjær et al., 2022) in samples where PCR metabarcoding did not yield positive results. In microbiome analyses, shotgun metagenomics can provide highly resolved insights into communities were the majority of the microbial populations are unculturable (Bendall et al., 2016; Frémont et al., 2022; Nayfach et al., 2021; Richter et al., 2022). One of the key advantages of such shotgun sequencing applied to ancient samples is the capacity to disentangle ancient templates from modern contaminants on the basis of typical sequence signatures of postmortem DNA damage (Briggs et al., 2007; Jónsson et al., 2013). However, shotgun sequencing is costly and requires a higher sequencing effort to detect low-abundance organisms. Additionally, in most studies conducted to date, over 90% of the reads remain unassigned due to the absence of reference sequences in relevant databases (Ahmed et al., 2018; Graham et al., 2016; Parducci et al., 2019; Pedersen et al., 2016; Wang et al., 2021). However, this percentage may vary depending on the field, with some prokaryote studies reporting 82% successful mapping of shotgun metagenomic reads (Mthethwa-Hlongwa et al., 2024).
Hybridization capture (also called “target capture”) enriches' target reads by focusing subsequent metagenomic sequencing efforts on nucleic acid strands captured by a predefined collection of baits covering taxonomically-informative biomarkers (Armbrecht et al., 2021; Murchie, Kuch, et al., 2021; Murchie, Monteath, et al., 2021; Schulte et al., 2022; Slon et al., 2017; Vernot et al., 2021). This method requires prior knowledge of the targeted sequence diversity as well as non-targets to enable rational design of complementary and specific (RNA/DNA)-baits (also called probes) to retrieve all the targets of interest. More specifically, synthesized polynucleotide baits are biotinylated and anneal to complementary eDNA metagenomic library templates. Streptavidin-coated paramagnetic beads are immobilized with a magnet while non-targeted library molecules wash away. The immobilized library molecules are eluted for further amplification and sequenced. Multiple studies have successfully applied hybridization capture to enrich DNA library content for both full mitochondrial genomes (Slon et al., 2017) and shorter barcode genes (e.g., Armbrecht et al., 2021; Foster et al., 2021; Lentz et al., 2021; Murchie, Kuch, et al., 2021; Murchie, Monteath, et al., 2021). For example, Slon et al. (2016, 2017) designed a set of baits to co-analyze the presence of 242 full mammalian mitochondrial genomes in cave sediments. Assuming sufficient target length, the approach preserves sequence signatures of ancient DNA damage patterns, which are paramount to data authentication. Moreover, the retrieval of fragments will not be restrained by availability of primer binding sites, or other PCR biases. Despite these potential advantages, only a few bait designs are readily available such as the “PalaeoChip Arctic-1.0 baitset,” which is designed to capture full mitochondrial genomes sublimated with commonly used barcoding genes such as COI, cytb, 12S, and 16S for ~180 vertebrate taxa and matK, rbcL, and trnL for over 2500 plant taxa occurring in Quaternary artic and boreal environments (Murchie, Kuch, et al., 2021). Such designs tend to be developed and validated only for particular biomes or regions since there are experimental and economic constraints on the number of baits that can be synthesized. To mitigate these issues, many bait designs reduce the number of capture sequences by clustering oligos within a set genetic distance, such as 96% identity score (Murchie, Kuch, et al., 2021). Nevertheless, these approaches might bias the results by containing baits that are most similar to the allele that was used as cluster centroid.
There is an growing number of studies reporting factors limiting the efficacy of hybridization capture (Chilamakuri et al., 2014; Cruz-Dávalos et al., 2017; Suchan, Chauvey, et al., 2022; Suchan, Kusliy, et al., 2022), identifying GC-content, complexity, read length, and reference bias, as affecting target capture outcomes. Less attention has been given to address specific biases of this method for enriching targets in environmental samples. Therefore, the extent to which the full taxonomic diversity originally present in the eDNA metagenomic library will be reflected in results from hybridization capture datasets is presently unknown. This study was designed to fill this gap in knowledge. First, we developed a universal method for designing bait sequences using online barcode sequence repositories. We then designed a bait panel targeting the chloroplast rbcL and matK genes from four plants orders (Fagales, Pinales, Asterales, and Poales), representing approximately 18% of the total number of accepted species in the World Checklist of Vascular Plants (Govaerts et al., 2021). We experimentally tested the performance of our bait set using mock communities, and identified important biases impacting downstream abundance estimates and ecological interpretation by comparing “unbiased” shotgun and target enriched sequencing data from these mock communities.
2 MATERIALS AND METHODS
2.1 Ancestral sequence reconstruction
All available matK and rbcL sequences, totaling 22,103 and 28,148 sequences respectively, were downloaded for the plant orders Fagales, Pinales, Asterales, and Poales from BOLD (October 18, 2019; Table 2) using the BOLD package (v0.8.6) in R (Chamberlain, 2021). Multiple sequence alignments (MSA) were made for each plant order and marker independently, using MAFFT with default settings with –adjustdirection as an additional flag (Katoh et al., 2002). Sequence alignments were visually checked to remove poorly aligned and low-quality sequences (e.g., sequences with frameshift, or non-coding triplets), and manually corrected for nucleotide calling errors when the original electropherograms were available from BOLD.
The BOLD database contains reference sequences covering either the complete or partial length of the standardized barcodes. Due to variations in primers and sequence trimming, the start- and endpoints of these records differ. For bait design, we focused on the section of the MSA that was represented by the most reference sequences, to avoid enriching for parts of the barcode gene that has little reference material. To determine the most suitable windows, we investigated the number of sequences and alignment length by sub-setting the MSA at the first and last position where a minimal percentage of the reference sequences had data (from 0 to 0.99 with 0.1 increments). Our approach aimed to maximize the number of sequences with the longest possible gene fragment size by multiplying the proportion sequences that covered the selected window by the length of the window (Figure 1). After size trimming, rbcL and matK alignments were visually checked once again, and corrected for gaps based on the protein sequences. Finally, the alignments were concatenated into longer sequences maximizing phylogenetic signal for downstream bait design. Taxa for which only one marker was available were excluded (leaving: N = 1721 and N = 2539, for matK and rbcL respectively, Table 2).

To avoid over-representing the most abundant sequences and to decrease redundancy, we decided to design bait sequences on reconstructed ancestral sequences. This method has been previously used to capture the mitogenomes of extinct animal species (glyptodont), for which no close relative exists (Delsuc et al., 2016). Phylogenetic trees were generated for each order using MrBayes (Ronquist et al., 2012) using the GTR substitution model (rates = invgamma). Bayesian MCMC analysis was run until the average standard deviation of split frequencies approached its lower plateau (~0.02–0.12, i.e., for 4e5 to 1e6 generations). Sample frequency was set to 10. MrBayes MCMC analysis was summarized (3e5 burn-ins, 1e6 generations), and a majority rule consensus tree was created. HyPhy was then used to fit the sequence data with the consensus tree using the GTR model (standard MG94 fit), before reconstructing ancestral sequences for all internal nodes of the consensus tree (Pond et al., 2005). The gene trees were obtained at order level, except for Poales and Asterales for which many reference sequences were available. For these orders, the reference sequences were split into clades that group families (e.g., Graminid, Cyperid, and Graminid for Poales). For families with a large number of species and sequences, only one species per genus was randomly selected.
2.2 Selecting ancestral sequences for bait design
Bait-template annealing remains effective for target enrichment despite the presence of up to 10%–13% substitutions (Mason et al., 2011; Paijmans et al., 2016; Peñalba et al., 2014). Therefore, it is not necessary to consider all phylogenetic nodes for capturing the large diversity of plant species found within the target orders. To decrease redundancy, we selected nodes where tips showed at most 9% nucleotide dissimilarity to the most distal ancestral sequence. Because the amount of nucleotide variation is variable within gene regions, nodes were selected within moving windows 80 nucleotides in length, representing the bait size. This was done by sub-setting the ancestral and tip sequences into 80 bases and calculating the distance from each internal node sequence to all connecting tips. The procedure was repeated until all tip sequences were matched to an ancestral node, and until the whole gene was covered with a one-base tiling. In cases where sequences were too distant from their closest ancestral node, the tip sequence was used for bait design. This procedure resulted in a multifasta file including sequences at those selected nodes and tips, which was used for designing 80-base long baits with 3× tiling. Finally, baits with >83% overlap and 100% identity were collapsed to a single random seed sequence with cd-hit (Li & Godzik, 2006). This was done to reduce the number of baits that were designed on ancestral nodes which showed high similarity. Furthermore, bait candidates showing more than 25% of base repeats were removed by myBaits®, Biosciences (USA) before bait array synthesis to filter for low complexity regions. The overall framework is presented on Figure 1, and an updated version of its computational implementation in python can be accessed at; https://github.com/Kevinnota/gotcha.
To validate the bait design, we simulated a total of 250,000 reads using gargammel (Renaud et al., 2017) using a size distribution of fragments typical for a degraded sample. The simulated data were restricted to the window of the gene that was used for bait design. The simulated reads were mapped to the bait sequences using bowtie2 (Langmead & Salzberg, 2012) using the --sensitive-local setting. The bam file was parsed using pysam (https://github.com/pysam-developers/pysam) to retain the alignment length, cigar string, and number of mismatches. The cigar string was used for the reads mapping with indels to identify the longest alignment, and the number of mismatches in that region. Overall, less than 0.5% of the simulated reads did not map to any of the bait sequences, while more than 99% of the mapped reads were within 9% identity to the baits. A total of 4084 80-mer RNA baits were produced at myBaits®, Arbor Biosciences (USA; for the bait sequences see, Data S1).
2.3 DNA extraction, mock community preparation, and sequencing
DNA was extracted from fresh leaf tissues collected from nine selected plant taxa (see Table 1). These taxa were selected based on their availability in the laboratory and to encompass a wide range of GC-content. Each plant tissue sample was extracted three times using the DNeasy Plant Mini Kit (Qiagen), following the manufacturer's instructions but with a single elution of DNA in 100 uL Elution Buffer. For each extraction batch, two extraction blanks were run alongside the samples. To generate a high number of on target template molecules, and to make mixtures of equal concentration of each taxon for investigating capture biases, we amplified the targeted loci for each species and shredded the PCR product before capture. The rbcLa gene was amplified using the rbcL-aF and rbcL-aR described in Kress & Erickson (2007). Each PCR reaction contained 1.25 U of GoTaq G2 Hot Start Taq Polymerase (Promega), 10 uL of 5× Green GoTaq Flexi Buffer, 0.4 uM of each primer, 0.2 mM of each dNTP, 2 mM of MgCl2, 2 μL of template DNA, in a 50 μL reaction. The PCR cycling protocol was as follows: 95°C for 2 min, followed by 40 cycles of 95°C for 30 s, 50°C (except for Nymphoides peltata, Juniperus communis, and Betula pubescens, for which annealing temperature was set to 52°C), and 72°C for 90 s, and a final extension of 5 min at 72°C. The matK locus was amplified with modified primers Plant_matK413f-2 (5′-TAATTTACGATCYATTCATTCAATATTTYC-3′) and matK-1227r-1 (5′GARGATCCRCTRTRATAATGAGAAAGATTT-3′) from (Heckenhauer et al., 2016). Two Pinales species were amplified using the matK-F and matK-R described in (Kusumi et al., 2000). The PCR mixture and cycling conditions were similar to those considered for the rbcLa locus, with the following modifications: MgCl2 concentration was increased to 3 mM, and the annealing temperature was reduced to 48°C. For the Pinales species the MgCl2 concentration remained at 2 mM, and the annealing temperature was set to 52°C.
Order | Taxa | rbcL | matK | ||
---|---|---|---|---|---|
GC | Total lengtha | GC-content | Total lengtha | ||
Asterales | Nymphoides peltata | 0.424 | 430 | 0.324 | 585 |
Fagales | Betula pubescens | 0.434 | 430 | 0.339 | 585 |
Quercus robur | 0.444 | 430 | 0.349 | 630 | |
Pinales | Juniperus communis | 0.427 | 430 | 0.330 | 1130 |
Larix decidua | 0.449 | 430 | 0.375 | 1222 | |
Poales | Bromus madritensis | 0.434 | 430 | 0.333 | 567 |
Glyceria notata | 0.444 | 430 | 0.327 | 614 | |
Juncus effusus | 0.406 | 430 | 0.295 | 567 | |
Typha latifolia | 0.424 | 430 | 0.315 | 576 |
- a The total length refers to the length of the marker that was enriched for.
The PCR product of each amplification was fragmented using Covaris M200 Sonicator (Covaris) using microTUBE AFA Fiber Pre-Slit Snap-Cap 6 × 16 mm. Two fragmentation steps were carried out, each at 7°C, with peak incident power of 75, duty factor (%) of 25, Cycles per burst (cpb) of 215 and AVG power of 18.8. The fragmentation durations were 380, 480, and 580 s for rbcL, matK and matK-Pinales, respectively. The fragmentation efficacy was checked and quantified on an Agilent 2100 Bioanalyzer system (Agilent), (Figures S1 and S2). A total of 3.00 × 107 copy DNA/uL of each fragmented PCR product were pooled and converted into a double stranded Illumina library, using the method described in Meyer and Kircher (2010), with the adapters and indexes from NEBNext® Multiplex Oligos for Illumina® (Dual Index Primers Set 1). DNA libraries were PCR indexed using 16 cycles, and quantified using the Collibri™ Library Quantification Kit (Invitrogen™), four times, considering 100-fold and 1,000-fold dilutions. All negative controls showed >100-fold lower copy numbers than the sample libraries. Eight identical mixtures were made using 30,000,000 copies of each library. Four of the library mixtures were captured using the myBaits plant bait set designed for this study, following the myBaits version 5.01 manual, and a hybridization temperature of 65°C. All eight mixtures, comprising four captured and four uncaptured samples, were paired-end sequenced on an iSeq 100 Sequencing System (Single end, 300 cycles, Illumina), which provided a total of 36,186–193,580 reads per mixture.
2.4 Bioinformatic processing and analysis of sequencing data
The raw reads were paired and trimmed using PEAR with a minimal assembly length of 25 nucleotides (Zhang et al., 2014). Duplicated reads were removed with seqkit rmdup (Shen et al., 2016). All reads were mapped to the rbcL and matK reference sequences (excluding primer binding sites) using bowtie2 with the -sensitive settings, and -k set to 5000. Taxonomic assignment was performed with the lowest common ancestor inference tool ngsLCA (Wang et al., 2022), considering all aligned pairs showing at least 95% similarity. All downstream GC and read length distributions were estimated on reads that were assigned to the lowest common ancestor with ngsLCA. The alignment length was calculated over the longest ungapped fraction from unique paired reads mapped against the bait sequences, using bowtie2 with the –sensitive-local settings, keeping only the best hit (Langmead & Salzberg, 2012) using the mapping CIGAR with pysam.
Linear regressions were performed using the change in relative abundance of reads assigned to lowest common ancestor post-capture as a response variable with the lm() function from the stats package in R (version 4.3.1). The best fitting model was chosen using the stepAIC function from the MASS package (Venables & Ripley, 2002). The predictor factors were calculated from local alignments of reads obtained from shotgun sequencing to baits as described before, except for retaining all mappings with the -k flag set to 4000. The GC-content and number of mismatches were calculated over the longest ungapped fraction of the reads. For the GC-content mismatches were ignored since those bases have no impact on the binding. A median was taken to summarize the GC-content and number of mismatches for each read. The GC-values were then averaged over the different taxonomic groups and a median was used for the number of mismatches. To investigate the relative importance of the predicating variables the calc.relimp function from the relaimpo package was used (Grömping, 2006). All downstream analyses were performed in R (R Core Team, 2022) and visualized using ggplot2 (Wickham, 2016). All scripts and data are accessible, see data statement.
3 RESULTS
3.1 Bait design and validation
In total, 18,663 rbcL and 13,278 matK sequences obtained from BOLD passed initial filtering, combined representing 7947 unique plant taxa, spread across 41 families (bold taxonomic identifiers). A total of 2050 taxa were used for bait design with “gotcha,” which produced a total of 4084 80-mers baits as detailed in Table 2. Of these baits, 3228 (79.0%) cover the matK locus, while 856 (21.0%) covered the rbcL locus. We found that only 1227 of 250,000 simulated reads (0.49%) did not align against any of the 4084 bait sequences. The read size of the unmapped reads were significantly shorter compared to the simulated reads (Kolmogorov–Smirnov test, D^- = 0.53428, p-value < 2.2e-16). Overall, ~37.5% of the simulated reads mapped without mismatch against the bait sequences, ~49.8% of the alignments featured one or two nucleotide mismatches, while ~12.1% had three or more such mismatches. Of all the reads simulated, ~98.9% aligned over at least 30 nucleotide bases, and about 86.2% for at least 60 nucleotide bases. Based on identity score, we estimated that ~99.4% of the simulated reads are within a 9% distance from their respective bait sequences.
Order | BOLD sequences | Filtered and fragment trimmed | Ancestral Seq. Rec. | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
N Seq | N genera | N filtered Seq | N Uniq Seq | N Seq (uniq taxa) | N genera | Prop Seq used bait design | ||||||
rbcL | matK | rbcL | matK | rbcL | matK | rbcL | matK | All | All | rbcL | matk | |
Asterales | 7810 | 6538 | 743 | 885 | 5568 | 3724 | 1037 | 1523 | 537 | 431 | 0.52 | 0.35 |
Fagales | 978 | 1317 | 37 | 38 | 747 | 794 | 139 | 228 | 126 | 26 | 0.91 | 0.55 |
Pinales | 2908 | 2102 | 74 | 74 | 2385 | 900 | 319 | 325 | 251 | 52 | 0.79 | 0.77 |
Poales | 16,452 | 12,146 | 741 | 680 | 9963 | 7860 | 1606 | 2516 | 1136 | 431 | 0.71 | 0.45 |
Total | 28,148 | 22,103 | 1595 | 1677 | 18,663 | 13,278 | 3101 | 4592 | 2050 | 940 | – | – |
3.2 Potential capture biases
The iSeq 100 sequencing run produced a total of 1,321,692 raw reads (128,887 ± 43,902) across the eight DNA libraries sequenced. After paired-end merging, 16.2 ± 0.15% (shotgun) and 22.8 ± 0.6% (capture) reads did not map against any reference sequences. Of these unmapped reads, 85.1 ± 0.9% of the shotgun and 89.3 ± 0.3% of the captured reads mapped using a local alignment, against at least two different reference sequences for more than 85% of the read length. The level of duplication was overall low, with 2.66 ± 0.54% being duplicated reads in normalized shotgun libraries, and 3.51 ± 0.79% in the captured libraries. Over the whole dataset (four libraries combined), the duplication rate of the shotgun libraries was 10.41% versus 13.08% for the captured libraries. Relative duplication rate change was highest in the matK gene with an average of 73.1 ± 22.3% more duplicates in the capture library than in the shotgun library (bins 30–40 to 140–150 nt.), while the relative duplication rate was 24.5 ± 15.9% in the rbcL (bins 30–40 to 140–150 nt.; Figures S4 and S5). The theoretical complexity based on simulations was not exhausted, with only 9.7% of the fragments with a GC-content between 30.0% and 60.0% recovered with the present sequencing effort (see Figure S6).
The read length distribution of merged and trimmed reads after capture had a lower number of reads in the size range 30–70 nucleotides compared to the shotgun libraries (Figure 2). A minor change in read length distribution was also observed between 70 and 100 nucleotides and an increase in templates longer than 110 nucleotides (Figure 2a). We found that the number of reads showing 20–45 nucleotides overlap to a bait sequence decreased by more than half in capture experiments as compared to the shotgun reference library. The relative changes between 55 and 70 nucleotides are minimal (~±14%), while the number of libraries with a bait annealing over the whole length increased by 38.9%. The proportion of shotgun reads that did not map against any of the bait sequences was ~1.4% and ~11.7% for rbcL and matK, respectively. Post-capture, this proportion was reduced to less than 0.4% for both markers (Figure S7).

We next investigated whether base composition of the baits influenced capture efficacy as multiple studies have reported variable on-target enrichment-folds according to bait GC-content (Cruz-Dávalos et al., 2017; Suchan, Chauvey, et al., 2022; Suchan, Kusliy, et al., 2022). To test this, we plotted the normalized read counts and the fold change between the captured and uncaptured sequence dataset, considering GC-categories in 5% intervals (Figure 2c). The relative change showed a depletion of reads with a CG content between 15% and 30% was between −65.7% and −31.3%. In contrast, more reads were observed between 35% and 55% GC categories post-capture, while 55%–80% GC targets were under-represented.
3.3 Effect of capture on relative proportion
We found that hybridization capture had an impact on the composition of DNA templates suitable for sequencing. Hence, next we assessed whether this change in composition could also affect the relative proportions of the different taxa represented in the mock communities. In the shotgun libraries, 55.8 ± 0.2% of the reads were assigned to the rbcL target. However, after capture, the reads assigned to matK increased by 11% and became more abundant (52.7 ± 0.7; Figure 3b). On a taxonomy level, 89.0 ± 0.1% of the matK shotgun reads were assigned to species, compared to 61.3 ± 0.2% for rbcL. After capture, the proportion of rbcL reads assigned to species remained identical to the shotgun library, while this fraction increased by 4.5% to 93.5 ± 0.2% for matK. Most of the rbcL reads that were not assigned to species could be assigned to the order level, representing 23.9 ± 0.2% and 25.8 ± 0.4% of shotgun and post-capture reads, respectively.

The relative taxonomic abundances at the lowest possible taxonomic level (for taxa with abundance >0.005) were highly similar between our four independent experimental replicates (Figure S8). We, thus, used the average of the four replicates for downstream analysis. Relative taxonomic abundances showed a moderate, yet significant, correlation between shotgun and post-capture libraries (R = 0.55, p-value = 0.0052; Figure 4a). Investigating the matK and rbcL reads independently, we found a weak, non-significant correlation for the reads assigned to matK (R = 0.36, p-value = 0.28; Figure 4b), but a moderate and significant positive correlation (R = 0.64, p-value = 0.013) was found when the low abundant taxa were included (Figure S9). The relative taxonomic abundances assessed using rbcL reads showed a strong, positive correlation between shotgun and reads post-capture (R = 0.83, p-value <0.01, Figure 4c).

We next examined potential variables influencing apparent taxa distribution patterns in shotgun and captured libraries, including mean GC-content of the reads assigned to taxa the shotgun library (GC), the average number of baits overlapping a given read in shotgun libraries (ANB), median number of mismatches between overlapping read in shotgun libraries (MMS), relative taxonomic proportion in shotgun libraries (RPS), and the total length of the captured markers per species (TL; the length for other taxonomic levels was set to 0). Multiple linear regression analysis highlights that ANP, MMS, and TL were significant predictors for the change in relative abundance post-capture, with TL having the highest relative importance of 68.0% followed by MMS (20.1%) and ANB (11.9%). However, these variables together explained only 50.58% of the observed variation (F = 6.886, 4 and 19 DF, p-value < 0.001; see Table 3A). When investigating the markers independently, the best fitting model for matK showed that all independent variables were significant (Table 3B), explaining 94.5% of the change in relative abundances (F = 17.46, 5 and 5 DF, p-value = 0.003495). The variables ANB, GC, and TL were positively correlated, while MMS and RPS showed a negative effect on the change in relative read counts. The relative importance of TL and GC where the highest with 40.5% and 29.9% respectively. The three remaining variables were all nearly equally important with 11.4% (MMS), 9.4% (ANB), and 8.7% (RPS). The linear regression for the rbcL marker showed that 94.8% (F = 56.1, 4 and 8 DF, p-value < 0.001) of the change could be explained by ANB with a relative importance of 56.4% and GC with a relative importance 43.6% (Table 3C).
Residuals: | Min | 1Q | Median | 3Q | Max |
---|---|---|---|---|---|
−0.024967 | −0.012274 | 0.002374 | 0.009362 | 0.032493 |
Coefficients | Estimate | SE | t value | Pr (>|t|) | Signif. | RI - lmg |
---|---|---|---|---|---|---|
Intercept | 3.40E-03 | 2.12E-02 | 0.161 | 0.874047 | ||
Mean_baits_shotgun (ANB) | 0.874047 | 1.86E-04 | 2.223 | 0.038574 | * | 11.9% |
Median_NM_shotgun (MMS) | −1.47E-02 | 5.48E+00 | −2.678 | 0.014872 | * | 20.1% |
Relative_proportion_shotgun (RPS) | −2.52E-01 | 1.66E-01 | −1.513 | 0.146629 | – | |
Capture_length (TL) | 4.H-05 | 9.73E-06 | 4.464 | 0.000266 | *** | 68.0% |
- Note: Residual standard error: 0.01585 on 19 degrees of freedom. Multiple R-squared: 0.5918, Adjusted R-squared: 0.5058. F-statistic: 6.886 on 4 and 19 DF, p-value: 0.001332. Significance: ***p < 0.001, **0.001 ≤ p < 0.01, and *0.01 ≤ p < 0.05.
Residuals: | Min | 1Q | Median | 3Q | Max |
---|---|---|---|---|---|
−0.0109222 | −0.0055359 | −0.0015226 | 0.0021567 | 0.0128313 |
Coefficients | Estimate | SE | t value | Pr(>|t|) | Signif. | RI - lmg |
---|---|---|---|---|---|---|
Intercept | −9.64E-02 | 3.51E-02 | 3.51E-02 | 0.04056 | * | |
Mean_baits_shotgun (ANB) | 1.03E-03 | 2.38E-04 | 4.32 | 0.00757 | ** | 9.4% |
Mean_gc_shotgun (GC) | 1.03E-03 | 1.07E-03 | 3.463 | 0.01798 | * | 29.9% |
Median_NM_shotgun (MMS) | −2.40E-02 | 5.50E-03 | −4.365 | 0.00726 | ** | 11.4% |
Relative_proportion_shotgun (RPS) | −7.22E-01 | 1.97E-01 | −3.663 | 0.01455 | * | 8.7% |
Capture_length (TL) | 5.42E-05 | 8.57E-06 | 6.32 | 0.00146 | ** | 40.5% |
- Note: Residual standard error: 0.009611 on 5 degrees of freedom. Multiple R-squared: 0.9458, Adjusted R-squared: 0.8917. F-statistic: 17.46 on 5 and 5 DF, p-value: 0.003495. Significance: ***p < 0.001, **0.001 ≤ p < 0.01, and *0.01 ≤ p < 0.05.
Residuals: | Min | 1Q | Median | 3Q | Max |
---|---|---|---|---|---|
−0.0031958 | −0.0014235 | −0.0003324 | 0.0012415 | 0.0060094 |
Coefficients | Estimate | SE | t value | Pr(>|t|) | Signif. | RI lmg |
---|---|---|---|---|---|---|
Intercept | −0.1724822 | 0.0182924 | −9.429 | 1.31E-05 | *** | |
Mean_baits_shotgun (ANB) | 0.0005478 | 0.0000637 | 8.599 | 2.59E-05 | *** | 56.4% |
Mean_gc_shotgun (GC) | 0.0033783 | 0.0003354 | 10.071 | 8.05E-06 | *** | 43.6% |
Median_NM_shotgun (MMS) | −0.0026946 | 0.0016655 | −1.618 | 0.1443 | – | |
Relative_proportion_shotgun (RPS) | −0.078938 | 0.0412027 | −1.916 | 0.0917 | – |
- Note: Residual standard error: 0.002995 on 8 degrees of freedom. Multiple R-squared: 0.9656, Adjusted R-squared: 0.9484. F-statistic: 56.09 on 4 and 8 DF, p-value: 6.831e-06. Significance: ***p < 0.001, **0.001 ≤ p < 0.01, and *0.01 ≤ p < 0.05.
4 DISCUSSION
Several bioinformatic tools have been developed for target capture bait design, which cover multiple applications. For example, Baitfisher helps find and filter bait sequences in multiple sequence alignments (MSA) to maximize bait design in variable regions (Mayer et al., 2016), while the recently developed HUBDesign delivers a unique bait set from multiple annotated genomes (Dickson et al., 2021). Additionally, Syotti is designed to identify bait sequences equally distant to each other (Alanko et al., 2022) and SupeRbaits provides bait candidate sets for population genetics studies (Jiménez-Mena et al., 2022). Some of these tools show some commonality to the approach presented in this study. For example, HUBDesign makes use of gene trees and reduces candidate gene sequences to a lowest common ancestor based on sequence identity (Dickson et al., 2021). However, to the best of our knowledge, none of the tools presently available are specifically tasked for designing oligonucleotide baits capable of characterizing the taxonomic composition of an environmental sample, besides clustering sequences or bait sequences at a certain threshold (Foster et al., 2021; Murchie, Kuch, et al., 2021). Our approach was designed to fill this gap, and is made available as a tool through the user-friendly, open-source python script “gotcha” available from GitHub (https://github.com/Kevinnota/gotcha).
We showed that using only sequence data from 2050 taxa (66.1% and 44.6% of all unique rbcL and matk sequences, respectively) in the bait design was enough to map 99.9% of simulated reads coming from all 7949 taxa. The size distribution of the unmapped reads was significantly shorter than the distribution of simulated reads, which indicates that short reads, likely from diverged taxa not used in the design, were not enough to overlap a bait. The low number of reads that did not match a bait confirms that our strategy building on ancestral sequence reconstruction to design hybridization baits not only allows for efficiently capturing the molecular diversity in the gene tree but also sequences which were not represented in the tree. Gene trees for bait design, therefore, do not need to be fully comprehensive in terms of taxonomic diversity, as long as they include highly divergent species important for the study area. This might make bait sets designed for one geographic area be highly suitable for other regions (see Figure S3A–G for the gene trees and the selected nodes for bait design).
Testing the designed bait set on mock communities, we observed a reduction in reads not mapping to capture targets from 11.7% to 0.4% for matK. This decrease in off-target reads confirms our probe set's effectiveness in capturing the desired fragments. Although, our mock communities were produced by shredding PCR product from the target genes, we expect, based on results from, Murchie, Kuch, et al. (2021) and Murchie, Monteath, et al. (2021), who also targeted chloroplast barcode regions that this observation will hold even in cases where the number of off-target molecules is much higher than in our mock samples. The reads obtained in the mock communities showed a clear enrichment for longer reads (>110 nucleotides) and a depletion of shorter reads (30–70 nucleotides). This phenomenon has been recorded in many studies and is likely a direct effect of binding potential between the bait and the library molecule, with longer templates stabilizing the hybridization (Cruz-Dávalos et al., 2017; Suchan, Chauvey, et al., 2022; Suchan, Kusliy, et al., 2022). We further observed a depletion of DNA templates with low GC-content post-capture (<30%) which is also in line with previous studies reporting enhanced capture efficiency between 35% and 60% GC-content (e.g., in exome capture experiments; Chilamakuri et al., 2014). By comparing the alignment length between the bait and the reads obtained in shotgun sequencing and post-captured libraries, we found that under the stringency conditions assessed in this study, the stability of bait-template annealing is enhanced when the alignment involves at least 50 bases. It is noteworthy that the 3× tiling used for designing baits resulted in 96.35% of simulated data aligning baits over at least 50 bases.
For ecological interpretation of capture results, it is important to assess the changes in taxonomic composition post-capture. We observed little change in assignment of reads at different ranks between the shotgun and captured dataset (e.g., similar number of reads were assigned to species before and after capture). Only a slight elevation in order level assignments with the rbcL marker was observed, indicating a minor enrichment for conserved reads here. The high number of reads assigned to species level in this study is due to the confinement of the database for mapping, to only the taxa in the mock communities. The taxonomic resolution for the markers will depend on the species composition of the location of the sample at hand. A potential way to increase the taxonomic resolution is to exclude baits which occur in more than one genus, or add more markers that have higher taxonomic resolution in the targeted species.
We further show a moderate, yet significant, correlation between taxon abundances in shotgun and capture reads, indicating that reads obtained after capture are somewhat following an expected pattern. However, this pattern is marker specific, since in the matK marker, a non-significant correlation was observed. The difference between markers highlights that, although captured with the same bait-set, both markers experienced read abundance distortions to a different degree. When investigating the variables that could explain the observed patterns, we found that different sets of variables influence each gene. The best linear model tested on both markers together could explain only ~50% of the variation. However, when linear regression models were applied to the markers independently, the tested variables explained nearly 95% of the variation. For matK, all variables tested were significant, but the highest relative importance were CL (40.5%) and GC (29.9%). The importance of the CL likely also explains the increase of the proportion of matK reads after capture, since this marker had a larger capture region for each taxon (see Table 1). The increase in matK reads might have been even greater if this marker had a GC-content comparable to that of rbcL. Further MMS, ANB, and RPS together have a relative importance of ~29.6%. For the much less taxonomically complex marker rbcL, ANB and GC were sufficient to explain all variation observed. The captured part of the rbcL gene does not have size variation across taxa and is highly conserved. This suggests that the MMS is not affecting the capture process, but rather it is the ANB which is predominantly driving any observed change.
Capture methods typically tolerate fragments showing divergence up to 10%–13% (Mason et al., 2011; Paijmans et al., 2016; Peñalba et al., 2014). Nonetheless, there is a clear indication that taxa with fewer nucleotide differences, and a higher number of baits mapping tend to be more efficiently captured. This implies that any attempt to decrease the number of baits will introduce bias into the capture process, which varies depending on the taxon. These findings collectively show that the bait design fundamentally introduces biases in reads abundance. While these biases might be minor in the case of rbcL, they can become significant in complex markers such as matK with low overall GC and high nucleotide diversity. We used our shotgun libraries to obtain all tested variables except CL. Among these variables, only RPS is truly sample dependent, as relative abundance calculations require shotgun sequencing and are therefore unique to each sample. The relative importance of RPS was only minor with 8.7% in our mock communities. Yet, it is unclear whether this relative importance will remain low in other mixtures. To test this, only mock communities with variable taxon abundances and taxon composition are suitable. The remaining variables, such as MMS, ANB, and GC, can be calculated for each taxon using simulated data, using the read length distributions of shotgun libraries of the samples.
5 CONCLUDING REMARKS
We successfully developed a universal set of baits for four plant orders with a set of small number of baits. Based on simulated reads, these baits are capable of capturing sequence diversity beyond what was originally used for bait design. We further experimentally show that the produced probes successfully captured all the taxa present in the assembled mock communities, over a large GC-content range. The overall distortion of relative read-proportions after capture was minimal for rbcL but high for matK. Our results strongly suggest that hybridization capture is notably influenced by a multitude of experimental factors, ranging from bait GC-content and taxonomic imbalance (i.e., different abundances of taxa in a sample).
These confounding variables ultimately influence the results, particularly concerning apparent relative abundances. Although biases are present, the effect of read abundance changes in our mock communities was predictable based on variables that can be easily calculated. This enables potential “in silico” investigation of bait-sets before synthesis.
Based on our observations, we recommend the following strategies: (1) Consider avoiding markers with low GC content or high GC content, or conversely exclude baits with low (>30%) or high GC (60%) content to mitigate GC biases; (2) aim for balanced baits design distributing an equal number of baits per taxa with an equal number of mismatches when estimating relative abundances; (3) limit interpretation of capture data to qualitative assessments (i.e., presence–absence of taxa rather than quantitative interpretations based on abundances). We also advocate to include mock community analyses including a wide variety of taxa to enable correction of major biases introduced during capture.
While our experimental design revealed important factors driving biases influencing taxonomic abundances, further research is required to shed light on other aspects of the methodology. These include assessing the power of target capture approaches to detect rare species, their sensitivity relative to amplicon-based metabarcoding techniques, and the influence of annealing temperatures and incubation times on the capture efficacy and taxonomic profiles. Additionally, testing multiple bait design strategies and capturing pooled libraries are recommended. Finally, it is important to evaluate the possible benefits or potentially stronger capture biases deriving from a second round of capture, a common practice for low-template DNA samples (e.g., ancient DNA).
AUTHOR CONTRIBUTIONS
K.N., L.O., L.P., and C.V. conceived the study. K.N. designed the baits, with input from L.O. The wet-lab experiments were designed by K.N., L.P., C.V., and L.O., and the laboratory work was performed by M.G. The manuscript was drafted by K.N., with contributions from L.O., C.V., and L.P. All authors contributed to data interpretation and to revising the manuscript.
ACKNOWLEDGMENTS
Brian Brunelle from Arbor Biosciences for help with the bait design. Petra Vainio, Nordic Sales Manager at TATAA Biocenter for providing advice on the capture design. The baits were funded by the Swedish Phytogeographical Society. This project has received funding from the CNRS, University Paul Sabatier (AnimalFarm IRP) and the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreements 681605-PEGASUS and 101071707-Horsepower). The python implication of “gotcha” (https://github.com/Kevinnota/gotcha) was in collaboration with Giobbe Forni at the University of Bologna. Open Access funding enabled and organized by Projekt DEAL.
CONFLICT OF INTEREST STATEMENT
No conflicts of Interest.
Open Research
DATA AVAILABILITY STATEMENT
The raw sequencing data produced in this study is available at ENA with study accession PRJEB76475, and runs accessions ERR13252521-ERR13252528. The bait sequences will be available as a Supplementary File. The python script to design phylogenetic informed probes is available on GitHub (https://github.com/Kevinnota/gotcha). The script for parsing the botwie2 mapped reads used for analyzing and plotting are available on GitHub (https://github.com/Kevinnota/capture_mock_communities), and all data files required to produce the figures are accessible on figshare (https://doi.org/10.6084/m9.figshare.26044429).