Funding: Funding was provided by the Gordon and Betty Moore Foundation (Andes Amazon Program). The authors declare no competing interests, financially or otherwise.

About

Sections

PDF

Tools

Share a link

Email
Wechat
Bluesky

ABSTRACT

Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly used together. Because the plastid regions are segments of protein-coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of ITS records is considerably more difficult for two reasons: the prevalence of indels and intraindividual sequence variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post-analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species-level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match, leads to a meaningful reduction of error rates in incomplete reference databases.

1 Introduction

DNA barcoding offers an alternative to morphological identification methods (Kress 2017). Instead of relying on morphologically diagnostic traits, plant tissue from any developmental stage can be used to infer a taxonomic identity. Plant metabarcoding builds on this concept and expands the methodology to samples that derive from multiple species. This approach facilitates dietary analysis and ecological interactions using matrices such as stomach contents or fecal matter (Hollingsworth et al. 2011; Bruni et al. 2015; Kartzinel et al. 2015). Other important applications are reconstructing plant species assemblages from ancient sediment, identifying food adulteration, validating herbal medicine, and tracking airborne pollen loads (Bruno et al. 2019; Urumarudappa et al. 2020; Huang et al. 2021; Krinitsina et al. 2023).

It is crucial to have the ability to compare the results from any barcoding study with a comprehensive, curated, and current DNA barcode reference database. Ideally, the reference database should include multiple sequences of every plant species obtained from well-identified, vouchered plant specimens (Kolter and Gemeinholzer 2021a). At present, reference databases are far from this ideal as more than 80% of flowering plant species either lack coverage entirely or have few records. Among the molecular markers used for plant barcoding, the internal transcribed spacer (ITS) region within the nuclear ribosomal DNA array (nrDNA) has factually become a core plant barcode (China Plant BOL Group et al. 2011). GenBank (Sayers et al. 2019) provides the most comprehensive, easily accessed source of plant ITS sequences. However, these records lack consistent annotation and are not screened for pseudogenes, and misidentifications are prevalent. RefSeq, a curated database also hosted by the National Center for Biotechnology Information (NCBI), partially addresses these deficits, but it only encompasses fungal ITS (O'Leary et al. 2016). Until now, only Canada (Kuzmina et al. 2017) and the UK (Jones et al. 2021) have published near-complete ITS reference libraries for use in a national context. In summary, a comprehensive, reliable DNA barcoding reference database for plants currently requires downloading and filtering GenBank data.

Multiple pipelines to create DNA barcode reference databases already exist, each with unique attributes. Four general workflows that can be applied to plants and other markers are (1) BCdatabaser, which focuses on user-friendly execution and automatic uploads to Zenodo (Keller et al. 2020); (2) rCRUX, which uses an iterative BLAST approach to identify unannotated marker sequences (Curd et al. 2024); (3) CRABS, which combines in silico PCR with an alignment to identify marker sequences for inclusion in the reference database (Jeunen et al. 2023); and (4) MetaCurator, which uses hidden Markov models to extract the desired marker region from a set of sequences (Richardson et al. 2020).

However, in general, plant-specific curatorial steps are necessary to deal with ITS pseudogenes and a lack of ITS primer specificity for plants, which can result in off-target amplification (Kolter and Gemeinholzer 2021b; Zhang et al. 2022). Although exposure to these complexities can be reduced by wet lab protocols, such as the use of plant-specific primers or adjustments in PCR conditions to avoid pseudogene amplification (Buckler et al. 1997; Cheng et al. 2016), these methods are rarely employed. Plant-specific curatorial steps are implemented in the following three protocols: (1) The static ITS Database V uses secondary structure validation to verify ITS2 sequences (Ankenbrand et al. 2015); (2) PLANits uses the software ITSx to annotate ITS marker sequences (Bengtsson-Palme et al. 2013; Banchi et al. 2020); (3) A database curation script by Quaresma et al. (2023) uses a dynamically executed workflow implementing plant-specific filter steps.

The present script was developed in response to the fact that current solutions lack functionalities essential to post-analytical validation and fail to retain sequence metadata. The novel combination of features introduced here are (1) automated taxonomic curation, (2) GenBank metadata retention, (3) occurrence information, and (4) the automatic addition of fungal sequences as a sink for contaminated query sequences. We further investigated the impact of database clustering and the exclusion of database matches based on distance thresholds on the success in securing a species-level identification.

2 Methods

The R script developed in this study is available on GitHub together with detailed usage notes (https://github.com/Andreas-Bio/ACVPMBD). R package snapshot information, created by the R package renv, is available to increase reproducibility (Ushey and Wickham 2023). All steps can be run without user intervention (Figure 1). Whenever the script is executed, a pre-set query, which can be modified by the user, determines which GenBank records will be downloaded by the R package rentrez (Winter 2017). Downloaded sequences without a species-level assignment are discarded, while other metadata is retained in tabular format and the sequence is formatted using the R package Biostrings (Pagès and Aboyoun 2017). Initial filters remove sequences whose species names contain non-alphabetical characters (e.g., numbers or symbols), which commonly indicate formatting errors or placeholders. Sequences with more than 2.5% ambiguous nucleotides (1% for 5.8S) or those shorter than a customizable minimum length (default: 100 bp for ITS1 and ITS2) are also discarded.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Flowchart of the data curation pipeline used to harvest and validate ITS data from GenBank. Overview of the ITS reference database curation workflow, from GenBank retrieval to filtering, taxonomic validation, and final data export. Spike-in sequences and GBIF occurrence data are incorporated before export to enhance coverage and usability.

If present, the 26S, ITS1, 5.8S, ITS2, and 18S regions are annotated, as detected by ITSx (Bengtsson-Palme et al. 2013). Chimeric sequences detected by ITSx are excluded, and sequences that fail to be annotated for either ITS1 or ITS2 are removed. In addition, sequences shorter than 100 bp (default) in ITS1 or ITS2 are excluded, and sequences with 5.8S region lengths outside the range of 150 to 170 bp (default) are filtered out, based on known Tracheophyta-specific thresholds. ITSx needs a small segment of the gene regions adjacent to ITS1/2 to function properly. However, some data uploaded to GenBank only include the core ITS1/2 regions as the flanking areas have been trimmed. To address this, a recovery effort is made by comparing successfully annotated ITS1/2 regions with other sequences. Unannotated sequences with a global (including terminal gaps) similarity of > 85% to the annotated sequences are retained, while the others are discarded. Sequence information and metadata are combined into a tab-delimited database file. Each dataset record is uniquely identified by its GenBank accession number, and it can have up to six associated sequences (ITS1, ITS2, full ITS, 26S, 18S, and 5.8S).

The taxonomy assigned to the sequence records extracted from NCBI is compared to the GBIF backbone taxonomy (GBIF Secretariat, 2023). In detail, a query, encompassing taxonomic levels from species (with author's epithet) to kingdom as per NCBI, is matched against the GBIF taxonomy backbone using the R package rgbif (Chamberlain and Boettiger 2017). In cases of discordances, the GBIF taxonomy has priority and the taxonomic assignment by NCBI is replaced but still retained as metadata.

The number of sequences per species is limited by a user-defined upper threshold applied during dereplication. The default of 10 is based on the results of an earlier study (Kolter and Gemeinholzer 2021a). If more than 10 sequences are available, priority is first given to full-length ITS sequences as identified by ITSx. Among these, sequences with distinct GenBank accession prefixes (first three characters) are favored to maximize genetic diversity, as different sequencing projects are typically assigned unique prefix codes.

Sequences are then examined to remove those reflecting contamination. To aid the recognition of contaminants, a library of outgroup sequences was established. This was accomplished by extracting one sequence from each fungal genus in the UNITE database (version 9.0) and adding a manually curated subset of sequences for Bryophyta and Algae from GenBank (Abarenkov et al. 2023). Based on a dynamic distance threshold, designed based on data from a previous study (Kolter and Gemeinholzer 2021a), sequences matching any outgroup sequence were removed from the ITS databases. To further improve post-analytical identification fidelity, especially in metabarcoding studies where fungal amplification is common, these fungal outgroup sequences were reintroduced as spike-ins after the filtering step. These spike-in sequences serve as taxonomic sinks, capturing off-target matches and preventing misclassification of fungal reads as plant sequences. This approach enhances the robustness of the reference database by accounting for known sources of cross-amplification and non-target taxa.

Multiple distance-based filtering steps were incorporated to recognize probable labeling or identification errors. First, using a leave-one-out cross-validation (LOOCV), the family of the top hit for each sequence was compared with that assigned to the query. When a mismatch was detected, both sequences (query, match) were again queried against the remaining records in the database. If the mismatch persists, the query sequence is removed. This step also leads to the removal of families represented by a single sequence. All following distance-based filtering steps were also applied to the 5.8S region with tightened thresholds due to its conserved nature. A second LOOCV filtering step removes query sequences from the database if the species with the top hit does not match the query species and the median distance of the hits to the query species is greater than a set threshold (default: 10%, 5.8S: 5%). In the third filtering step, any sequence not matching any other sequence within a defined threshold (default: 30%, 5.8S: 10%) is removed from the database. Finally, every region (separately: ITS1/2, full ITS, 5.8S) with more than a set threshold of ambiguous nucleotides (default: 2.5%, 5.8S: 1%) is removed from the database.

Distributional information for each species is added to the database based on GBIF occurrence data. The countries, along with their respective occurrence count, if above three, are reported in alphabetical order and grouped by the United Nations region definition.

The script outputs the whole database in a tab-delimited format, as a reduced database without sequence information, and for any specified ITS region as a fasta file. An ITS sequence length plot, summarized by family, and an interactive taxonomic summary plot are automatically generated (Ondov et al. 2011). In addition to a repository on Zenodo (10.5281/zenodo.10257823), which includes all temporary files, the database has also been uploaded to BOLD (project code: PREF), which makes the data available through the publicly accessible identification engine. The script output presented in this publication was generated on 20 Nov 2023 using default parameters.

3 Cluster threshold impact on identification success

Clusters were generated from three sequence datasets (ITS1, ITS2, and full ITS) derived from the script output of the 20 November 2023 run (see above), using seven predefined similarity thresholds: 1.0, 0.995, 0.99, 0.98, 0.97, 0.96, and 0.95. Viewed from the perspective of sequence matching, a clustering threshold of 1 generates an identical result to an unclustered database if multiple top matches are evaluated. The clusters were given new labels based on the lowest taxonomic lineage shared among members of each cluster.

For each ITS region, sequences from species represented by at least two accessions (45,302 ITS1; 47,677 ITS2; 45,302 full ITS) were matched against clustered reference databases generated from the same datasets. Each comparison was performed across seven similarity thresholds using VSEARCH (Rognes et al. 2016). To prevent self-matching, the query sequence was removed from the matching cluster before evaluation, and the cluster's taxonomic label was updated based on the remaining sequences. Top hits were defined as the set of one or more reference sequences sharing the highest percent identity to the query, calculated using global alignment with terminal gaps excluded. These matches were evaluated for taxonomic consistency with the query. A match was considered a successful species-level identification if all top-hit sequences were assigned to a single species and their species name was identical to that of the query (e.g., all matches and the query identified as Quercus robur). If the top-hit set included sequences from multiple species but the query species was among them (e.g., matches from Quercus robur, Q. petraea, and the query was Q. robur), or if the matches were assigned only to a higher-level taxon (e.g., genus Quercus) that included the query species, the identification was classified as ambiguous. A match was considered unsuccessful if the query species was absent from the top-hit set (e.g., matches from Q. petraea and Q. cerris but query was Q. robur) or if the query did not fall within the assigned higher-level taxonomic label of the top matches (e.g., matches labeled as family Fagaceae, while the query belonged to a different family) (Figure 2). A total of 4,116,117 evaluations, across all combinations, were combined in an alluvial plot by the R packages ggplot2 and ggalluvial (Brunson 2020; Wickham 2016).

4 Threshold based rejection of database matches

We re-analyzed matches between ITS sequences and clusters generated with a similarity threshold of 1 (ESV) generated earlier (Figure 3). In contrast to the evaluation discussed earlier, two query sets were used: one with species having at least two sequences and another including all species, including those represented by one sequence. As self-hits were removed for both query sets, this increased the incidence of incorrect or ambiguous match assignments if the query species was eliminated from the reference database by this step, simulating an incomplete database (Figure 4). Top-hits with a distance (nucleotide similarity percentage without terminal gaps) below seven similarity thresholds (1, 0.995, 0.99, 0.98, 0.97, 0.96, and 0.95) were categorized as rejected. These steps were repeated for each ITS region, for each threshold and for both query sets. All other evaluation and visualization steps were identical (see above).

5 Results

The ITS reference database for plants generated by the present script (10.5281/zenodo.10257823) encompasses 271,418 records, each indexed by its GenBank accession number (Table 1). Every record includes 40 columns of metadata and DNA sequences (Supplementary Table 1). Over 80% of ITS2 and 90% of ITS1 sequences were derived from the full ITS (226,407) sequence annotations (Table 1). In addition, the database includes sequences for three additional nrDNA regions (110,039 partial 18S, 234,477 full 5.8S, 86,003 partial 26S), which can be used for primer design or custom data validation. Approximately 70% of all records have voucher information (defined by the sequence owner), extracted from GenBank, linking the sequence to a physical sample (10.5281/zenodo.10257823). Overall, 86% of species possessed at least one data record with voucher information. The database lists 19,945 unique author names and 10,907 unique publications. Almost all records (98%) have at least one GBIF occurrence (summarized as country and count). Comparison of the species names associated with the accessions downloaded from GenBank indicated that approx. 12% (12,861/107,710) are regarded as synonyms by GBIF. NCBI taxonomy queries to the GBIF backbone taxonomy yielded a high match rate, with an average confidence score of > 99.5% for returned matches.

TABLE 1. Database summary counts by filtering steps.

step	Subset	sequences	species	genera	Families	order
GenBank download	all records	437414	115337	11501	455	78
species-level ID & ambiguous nucleotide filter	all records	409034	112545	11339	449	77
ITSx	ITS1	328493	104283	10747	417	75
	ITS2	358985	106454	10927	437	76
	full ITS	316524	101794	10589	413	75
	all records	370954	108567	11049	441	76
rescue sequences	ITS1	336786	106667	10906	417	75
	ITS2	379015	108440	11057	440	77
	full ITS	316524	101794	10589	413	75
	all records	399277	111201	11218	444	77
GBIF taxonomy	ITS1	335698	98200	9914	393	76
	ITS2	377885	99644	10075	415	78
	full ITS	315505	94010	9663	389	76
	all records	398078	102046	10201	419	78
dereplication	ITS1	245596	98200	9914	393	76
	ITS2	264245	99644	10075	415	78
	full ITS	232045	94010	9663	389	76
	all records	277796	102046	10201	419	78
fungal filter	ITS1	245526	98190	9912	393	76
	ITS2	264159	99629	10072	415	78
	full ITS	231996	94002	9661	389	76
	all records	277689	102029	10198	419	78
distance based filter & ambiguous nucleotide filter	ITS1	239815	96320	9650	343	75
	ITS2	258010	97720	9787	360	76
	full ITS	226407	92111	9386	339	74
	all records	271418	100132	9919	363	77

6 Cluster threshold impact on identification success

Figure 3 shows how success in species identification using the three ITS databases is impacted by clustering thresholds. Considering only those species with at least two ITS sequences (191,927 ITS1; 219,396 ITS2; 176,714 full ITS) comparisons against clustered reference databases generated from the same sequence sets revealed a progressive decline in species-level identification success as clustering thresholds were lowered (Figure 3). For example, clustering at 97% identity led to a species assignment for < 25% of the queries, roughly half the success for an unclustered (or clustered at 100% identity) database (Figure 3). Considering all three ITS sequence sets, the percentage of identified sequences dropped at least twice as fast as the error rate for species identification for each clustering step (Figure 3). The full ITS region is more sensitive to clustering, as a clustering approach by similarity percentages affects more nucleotides due to its length, disproportionally diminishing the identification gain over ITS1/2 even at the smallest clustering step (Figure 3).

7 Similarity threshold impact on species-level identification success

Figure 4 shows the impact of rejecting reference database matches for varying similarity thresholds using a database with a cluster threshold of 1, representing all exact sequence variants (ESV). The first analysis, which is based on species with at least two ITS sequences, shows that only accepting reference database matches with a similarity higher than 0.99 leads to a higher loss of successful identifications than of erroneous identifications (Figure 4A). The use of lower acceptance thresholds (0.98, 0.97, 0.96, and 0.95) has only a minor impact, changing error and identification rates by < 1% (Figure 4A). Depending on the ITS region, a rejection threshold of 0.98 removes about 5% of all evaluated matches, with the split between removed erroneous and successful identifications being near equal (Figure 4A).

The second analysis employed an expanded dataset, as it included species with only one sequence (~54% of total species). Because the removal of self-hits prevents the latter species from receiving a correct match, the error rate is twice as high as in the previous analysis (Figure 4). At all rejection thresholds, except 1, more erroneous identifications are removed than successful identifications (Figure 4B). For full ITS, a threshold of 0.995 rejected 28.5% of all matches. Approximately 18% of the excluded matches were erroneous identifications, 7.5% were correct species assignments, and 3% were ambiguous assignments. Overall, this decreases the error rate from ~29.5% to ~11.5%, comparable to the 12.5% of the previous analysis where species represented by one sequence were excluded at a reject threshold of 0.98 with 5% total rejections (Figure 4).

8 Discussion

8.1 Evaluation and choosing of ITS plant reference databases

Multiple workflows have been used to create ITS reference databases for plants, but there is no objective metric to compare their performance. The reference database with the most species will inevitably be the one that has received the least amount of curation. The database(s) with the highest entropy or with the most unique sequences will have employed the most relaxed filtering for pseudogenes (Dubois et al. 2022). The database with the highest sequence count after dereplication will simply be the one which has been downloaded from GenBank most recently. The database with the highest rate of successful identifications has been generated by the most aggressive filtering after cross-validation, using a priori taxonomic knowledge, but this approach potentially removes valid sequence variation within a species. Ideally, the database which best represents the intraspecific diversity of any given species should be employed. However, that is unverifiable due to the low number of sequences per species in GenBank. To ensure the quality of this database, we implemented the following five steps to address the plant-specific challenges in the use of ITS for barcoding:

(1) Use ITSx for Annotation: Although primarily designed to annotate individual segments of the ITS array (18S, ITS1, 5.8S, ITS2, and 26S), it can be used to identify and remove pseudogenes (Buckler et al. 1997; Harpke and Peterson 2008; Bengtsson-Palme et al. 2013). Substitutions in conserved regions can cause annotation failures due to mismatches with the HMMER profiles used by ITSx (Finn et al. 2011). Additionally, user-defined taxonomic limitations in ITSx's HMMER profiles can help to exclude sequences from non-target organisms, such as fungi.

(2) Adopt strict thresholds for 5.8S: The conserved 5.8S region has been used as a diagnostic tool for decades and is an important tool in post-analytical validation (Jobes and Thien 1997) both for intraspecific genetic distance and sequence length, to optimize filter performance.

(3) Employ fixed thresholds for dereplication: instead of keeping all unique sequences, this approach does not exacerbate ITS sequence diversity by treating rare pseudogenes and functional ITS variants equally. Especially rare ITS variants from ancient hybridization events, which can occur even across multiple genera, complicate barcoding (Mahelka and Kopecký 2010).

(4) Identify and remove incorrectly annotated plant sequences: adding correctly annotated fungi accounts for non-plant amplification due to low plant-specificity in some ITS primers (Hollingsworth et al. 2011; Kolter and Gemeinholzer 2021b).

(5) Relax intraspecific distance-based filtering: attributes to the fact that ITS sequences from one specimen might present themselves in divergent variants, for example, forming non-monophyletic species (Xu et al. 2017). Although a more stringent filtering potentially detects more taxonomically incorrectly named sequences, the number of sequences per species, on average, is well below 3, making it difficult to find statistically robust means of excluding sequences (Kolter and Gemeinholzer 2021a). It also cannot be assumed that the majority of sequences of any given species are correct due to widespread systematic errors across multiple studies, such as in the genus Equisetum (Ibi et al. 2022).

8.2 Incorporating plant occurrence information

This workflow includes occurrence data from GBIF but does not generate regional subsets of the database. Instead, we opt for a post-analysis verification step defined by extracting occurrence information from metadata and checking the plausibility of barcoding results manually.

Although the exclusion of plant species absent from a particular study area can raise the success of assigning sequences to a plant species, it has limitations. On average, just 55% of the species in plant lists for European nations possess ITS barcode sequences on GenBank, limiting identification success (Quaresma et al. 2023). Moreover, the completeness of national plant lists is unverifiable, and they often exclude introduced species. For instance, GBIF records fail to include commonly cultivated plants (e.g., doi.org/10.15468/dl.8u8d6h). Contaminating sequences (e.g., Gossypium, Citrus) also might be absent from national reference databases, further complicating identification. For these reasons, restricting a barcode reference database, using potentially incomplete national plant lists in combination with incomplete barcode coverage might result in an increased rate of unverifiable false-positive identifications masquerading as a species-level identification increase.

Despite concerns about the reliability of GBIF occurrence data, many criticisms do not apply here (Zizka et al. 2020). Filters based on inaccurate coordinates, duplicate records, incorrect dates, sea/land discrepancies, urban areas, and unknown collection methods are irrelevant for this study, as the data are aggregated at the country level, and the sampling method is not a concern. Including urban areas is necessary for environmental DNA studies. The only problematic records are herbarium records incorrectly attributed to storing institutions instead of their actual occurrence. Disjunct distributions, such as those combining Asia, the UK, and the USA, should be checked for plausibility. Automatic filters are not feasible, as many invasive species are found in EU or North American countries.

In conclusion, we propose a post-analysis alternative to database subsets, enabling researchers to make informed decisions rather than excluding matches based on unverifiable parameters. Unlike static plant lists, this workflow enables automatic occurrence data retrieval during the execution of the script.

8.3 Clustering and distance thresholds

Clustering of ITS sequences has been reported multiple times (Banchi et al. 2020; Wirta et al. 2021; Namin et al. 2022). However, studies comprehensively analyzing the impact of such clustering on a larger scale are missing. Our analysis shows that clustering with a similarity threshold smaller than 1 (ESV) reduces identification success disproportionately to the reduction of erroneous identifications and should hence be avoided (Figure 3).

The enforcement of distance thresholds for identification, an approach which rejects matches between a query sequence and records in the database when a defined similarity value falls below a threshold, has also been employed, but without a plant-specific study validating this approach (Parveen et al. 2016; Lucek et al. 2019; Bänsch et al. 2020). Our study reveals that the adoption of an identification threshold, using a global plant database, can reduce the misclassification in species-level plant identifications under certain conditions. If all anticipated query species are present in the reference database, there is no need to set a threshold, but it should not be stricter than 0.98 if adopted (Figure 4A). In a more realistic scenario, using a 0.995 threshold on a reference database missing half of the query species, we rejected ~28% of matches, achieving an error rate similar to that of a complete database (Figure 4B). Generally, we considered a threshold beneficial if the number of rejected identifications that would have otherwise scored as successful is not higher than the number of rejected identifications which otherwise would have been erroneous.

Applying a universal threshold across Tracheophyta introduces errors due to variations in intra- and interspecific distances of ITS sequences (Kolter and Gemeinholzer 2021a). In our study, reclassification based on a 0.995 threshold, using an incomplete database, correctly reclassifies affected sequences in 3 out of 4 cases and introduces false negatives in 1 out of 4 cases (7.5% out of 28%, see Methods). Using this fixed threshold is a stochastic trade-off which, on average, improves the identification reliability while not introducing additional false positive errors. While in the case of diverse samples, such as environmental DNA, this stochastic approach most likely will not result in any disproportionately negative impact; we hypothesize that targeted studies, limited to a specific set of taxa, should re-evaluate the threshold proposed here.

9 Conclusions

By coupling sequence information with occurrence data and GenBank metadata, the present script allows researchers to validate their identifications using multiple metrics. The unattended script execution provides a dynamic workflow, ensuring up-to-date reference databases. Our analysis showed that clustering of records leads to a substantial reduction in the success of species assignment but distance-based identification thresholds improved identification accuracy in scenarios involving incomplete reference databases. Future improvements, such as more sophisticated filtering techniques, depend on closing gaps in reference databases and increasing the average number of sequences per species. Retaining the flanking sequence information (18S & 28S) from GenBank makes this workflow suitable for future long-read barcoding reference database creation (e.g., Nanopore).

Author Contributions

Andreas Kolter: conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing – original draft, writing – review and editing, visualization; Paul Hebert: resources, writing – review and editing, supervision, project administration, funding acquisition.

Conflicts of Interest

The authors declare no conflicts of interest.

Open Research

Data Availability Statement

Scripts and data used to create the results presented in this publication are available from: https://github.com/Andreas-Bio/ACVPMBD. All output files for all filtering steps and the final reference data files can be downloaded at DOI: 10.5281/zenodo.10257823. Benefits Generated: This research provides benefits through the public release of a curated ITS reference database and a workflow that supports more accurate species identification in plant DNA barcoding and metabarcoding.

Supporting Information

References

Abarenkov, K., A. Zirk, T. Piirmann, et al. 2023. “ UNITE general FASTA release for Fungi.” https://doi.org/10.15156/BIO/2938067.
10.15156/BIO/2938067
Google Scholar
Ankenbrand, M. J., A. Keller, M. Wolf, J. Schultz, and F. Förster. 2015. “ITS2 Database V: Twice as Much.” Molecular Biology and Evolution 32: 3030–3032. https://doi.org/10.1093/molbev/msv174.
10.1093/molbev/msv174
CAS PubMed Web of Science® Google Scholar
Banchi, E., C. G. Ametrano, S. Greco, D. Stanković, L. Muggia, and A. Pallavicini. 2020. “PLANiTS: a curated sequence reference dataset for plant ITS DNA metabarcoding.” Database 2020: baz155. https://doi.org/10.1093/database/baz155.
10.1093/database/baz155
PubMed Google Scholar
Bänsch, S., T. Tscharntke, R. Wünschiers, et al. 2020. “Using ITS2 metabarcoding and microscopy to analyse shifts in pollen diets of honey bees and bumble bees along a mass-flowering crop gradient.” Molecular Ecology 29: 5003–5018. https://doi.org/10.1111/mec.15675.
10.1111/mec.15675
PubMed Web of Science® Google Scholar
Bengtsson-Palme, J., M. Ryberg, M. Hartmann, et al. 2013. “Improved Software Detection and Extraction of its1 and Its 2 from Ribosomal its Sequences of Fungi and Other Eukaryotes for Analysis of Environmental Sequencing Data.” Methods in Ecology and Evolution 4: 914–919. https://doi.org/10.1111/2041-210X.12073.
10.1111/2041-210X.12073
Web of Science® Google Scholar
Bruni, I., A. Galimberti, L. Caridi, et al. 2015. “A DNA barcoding approach to identify plant species in multiflower honey.” Food Chemistry 170: 308–315. https://doi.org/10.1016/j.foodchem.2014.08.060.
10.1016/j.foodchem.2014.08.060
CAS PubMed Web of Science® Google Scholar
Bruno, A., A. Sandionigi, G. Agostinetto, et al. 2019. “Food Tracking Perspective: DNA Metabarcoding to Identify Plant Composition in Complex and Processed Food Products.” Genes 10: 248. https://doi.org/10.3390/genes10030248.
10.3390/genes10030248
CAS PubMed Web of Science® Google Scholar
Brunson, J. 2020. “ggalluvial: Layered Grammar for Alluvial Plots.” J. Open Source Softw 5: 2017. https://doi.org/10.21105/joss.02017.
10.21105/joss.02017
PubMed Google Scholar
Buckler, E. S., A. Ippolito, and T. P. Holtsford. 1997. “The Evolution of Ribosomal DNA Divergent Paralogues and Phylogenetic Implications.” Genetics 145: 821–832. https://doi.org/10.1093/genetics/145.3.821.
10.1093/genetics/145.3.821
CAS PubMed Web of Science® Google Scholar
Chamberlain, S. A., and C. Boettiger. 2017. “R Python, and Ruby clients for GBIF species occurrence data (preprint).” PeerJ Preprints 5:e3304v1. https://doi.org/10.7287/peerj.preprints.3304v1.
10.7287/peerj.preprints.3304v1
Google Scholar
Cheng, T., C. Xu, L. Lei, C. Li, Y. Zhang, and S. Zhou. 2016. “Barcoding the kingdom Plantae: new PCR primers for ITS regions of plants with improved universality and specificity.” Molecular Ecology Resources 16: 138–149. https://doi.org/10.1111/1755-0998.12438.
10.1111/1755-0998.12438
CAS PubMed Web of Science® Google Scholar
China Plant BOL Group, D.-Z. Li, L.-M. Gao, et al. 2011. “Comparative analysis of a large dataset indicates that internal transcribed spacer (ITS) should be incorporated into the core barcode for seed plants.” Proceedings of the National Academy of Sciences 108, no. 19: 641–646. https://doi.org/10.1073/pnas.1104551108.
10.1073/pnas.1104551108
Google Scholar
Curd, E. E., L. Gal, R. Gallego, K. Silliman, S. Nielsen, and Z. Gold. 2024. “rCRUX: A Rapid and Versatile Tool for Generating Metabarcoding Reference libraries in R.” Environmental DNA 6: e489. https://doi.org/10.1002/edn3.489.
10.1002/edn3.489
PubMed Web of Science® Google Scholar
Dubois, B., F. Debode, L. Hautier, et al. 2022. “A detailed workflow to develop QIIME2-formatted reference databases for taxonomic analysis of DNA metabarcoding data.” BMC Genomic Data 23: 53. https://doi.org/10.1186/s12863-022-01067-5.
10.1186/s12863-022-01067-5
PubMed Web of Science® Google Scholar
Finn, R. D., J. Clements, and S. R. Eddy. 2011. “HMMER web server: interactive sequence similarity searching.” Nucleic Acids Research 39: W29–W37. https://doi.org/10.1093/nar/gkr367.
10.1093/nar/gkr367
CAS PubMed Web of Science® Google Scholar
Harpke, D., and A. Peterson. 2008. “5.8S motifs for the identification of pseudogenic ITS regions.” Botany 86: 300–305. https://doi.org/10.1139/B07-134.
10.1139/B07-134
CAS Web of Science® Google Scholar
Hollingsworth, P. M., S. W. Graham, and D. P. Little. 2011. “Choosing and Using a Plant DNA Barcode.” PLoS One 6: e19254. https://doi.org/10.1371/journal.pone.0019254.
10.1371/journal.pone.0019254
CAS PubMed Web of Science® Google Scholar
Huang, S., K. R. Stoof-Leichsenring, S. Liu, et al. 2021. “Plant Sedimentary Ancient DNA From Far East Russia Covering the Last 28,000 Years Reveals Different Assembly Rules in Cold and Warm Climates.” Frontiers in Ecology and Evolution 9, no. 763: 747. https://doi.org/10.3389/fevo.2021.763747.
10.3389/fevo.2021.763747
Google Scholar
Ibi, A., M. Du, T. Beuerle, D. Melchert, J. Solnier, and C. Chang. 2022. “A Multi-Pronged Technique for Identifying Equisetum palustre and Equisetum arvense—Combining HPTLC, HPLC-ESI-MS/MS and Optimized DNA Barcoding Techniques.” Plants 11: 2562. https://doi.org/10.3390/plants11192562.
10.3390/plants11192562
CAS Google Scholar
Jeunen, G., E. Dowle, J. Edgecombe, U. Von Ammon, N. J. Gemmell, and H. Cross. 2023. “CRABS—A software program to generate curated reference databases for metabarcoding sequencing data.” Molecular Ecology Resources 23: 725–738. https://doi.org/10.1111/1755-0998.13741.
10.1111/1755-0998.13741
PubMed Web of Science® Google Scholar
Jobes, D. V., and L. B. Thien. 1997. “A Conserved Motif in the 5.8S Ribosomal RNA (rRNA) Gene is a Useful Diagnostic Marker for Plant Internal Transcribed Spacer (ITS) Sequences.” Plant Molecular Biology Reporter 15: 326–334. https://doi.org/10.1023/A:1007462330699.
10.1023/A:1007462330699
CAS Web of Science® Google Scholar
Jones, L., A. D. Twyford, C. R. Ford, et al. 2021. “Barcode UK: A complete DNA barcoding resource for the flowering plants and conifers of the United Kingdom.” Molecular Ecology Resources 21: 2050–2062. https://doi.org/10.1111/1755-0998.13388.
10.1111/1755-0998.13388
CAS PubMed Web of Science® Google Scholar
Kartzinel, T. R., P. A. Chen, T. C. Coverdale, et al. 2015. “DNA metabarcoding illuminates dietary niche partitioning by African large herbivores.” Proceedings of the National Academy of Sciences 112: 8019–8024. https://doi.org/10.1073/pnas.1503283112.
10.1073/pnas.1503283112
CAS PubMed Web of Science® Google Scholar
Keller, A., S. Hohlfeld, A. Kolter, J. Schultz, B. Gemeinholzer, and M. J. Ankenbrand. 2020. “BCdatabaser: on-the-fly reference database creation for (meta-)barcoding.” Bioinformatics 36: 2630–2631. https://doi.org/10.1093/bioinformatics/btz960.
10.1093/bioinformatics/btz960
CAS PubMed Web of Science® Google Scholar
Kolter, A., and B. Gemeinholzer. 2021a. “Plant DNA barcoding necessitates marker-specific efforts to establish more comprehensive reference databases.” Genome 64: 265–298. https://doi.org/10.1139/gen-2019-0198.
10.1139/gen-2019-0198
CAS PubMed Web of Science® Google Scholar
Kolter, A., and B. Gemeinholzer. 2021b. “Internal transcribed spacer primer evaluation for vascular plant metabarcoding.” Metabarcoding Metagenomics 5: e68155. https://doi.org/10.3897/mbmg.5.68155.
10.3897/mbmg.5.68155
Google Scholar
Kress, W. J. 2017. “Plant DNA barcodes: Applications today and in the future.” Journal of Systematics and Evolution 55: 291–307. https://doi.org/10.1111/jse.12254.
10.1111/jse.12254
Web of Science® Google Scholar
Krinitsina, A. A., D. O. Omelchenko, A. S. Kasianov, et al. 2023. “Aerobiological Monitoring and Metabarcoding of Grass Pollen.” Plants 12: 2351. https://doi.org/10.3390/plants12122351.
10.3390/plants12122351
CAS Google Scholar
Kuzmina, M. L., T. W. A. Braukmann, A. J. Fazekas, et al. 2017. “Using Herbarium-Derived DNAs to Assemble a Large-Scale DNA Barcode Library for the Vascular Plants of Canada.” Applications in Plant Sciences 5: 1,700,079. https://doi.org/10.3732/apps.1700079.
10.3732/apps.1700079
Web of Science® Google Scholar
Lucek, K., A. Galli, S. Gurten, et al. 2019. “Metabarcoding of honey to assess differences in plant-pollinator interactions between urban and non-urban sites.” Apidologie 50: 317–329. https://doi.org/10.1007/s13592-019-00646-3.
10.1007/s13592-019-00646-3
Web of Science® Google Scholar
Mahelka, V., and D. Kopecký. 2010. “Gene Capture from across the Grass Family in the Allohexaploid Elymus repens (L.) Gould (Poaceae, Triticeae) as Evidenced by ITS, GBSSI, and Molecular Cytogenetics.” Molecular Biology and Evolution 27: 1370–1390. https://doi.org/10.1093/molbev/msq021.
10.1093/molbev/msq021
CAS PubMed Web of Science® Google Scholar
Namin, S. M., M.-J. Kim, M. Son, and C. Jung. 2022. “Honey DNA metabarcoding revealed foraging resource partitioning between Korean native and introduced honey bees (Hymenoptera: Apidae).” Scientific Reports 12, no. 14: 394. https://doi.org/10.1038/s41598-022-18465-5.
10.1038/s41598?022?18465?5
PubMed Google Scholar
O'Leary, N. A., M. W. Wright, J. R. Brister, et al. 2016. “Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.” Nucleic Acids Research 44: D733–D745. https://doi.org/10.1093/nar/gkv1189.
10.1093/nar/gkv1189
CAS PubMed Web of Science® Google Scholar
Ondov, B. D., N. H. Bergman, and A. M. Phillippy. 2011. “Interactive metagenomic visualization in a Web browser.” BMC Bioinformatics 12: 385. https://doi.org/10.1186/1471-2105-12-385.
10.1186/1471-2105-12-385
PubMed Web of Science® Google Scholar
Pagès, H., and P. Aboyoun. 2017. “ Biostrings.” https://doi.org/10.18129/B9.BIOC.BIOSTRINGS.
10.18129/B9.BIOC.BIOSTRINGS
Google Scholar
Parveen, I., S. Gafner, N. Techen, S. Murch, and I. Khan. 2016. “DNA Barcoding for the Identification of Botanicals in Herbal Medicine and Dietary Supplements: Strengths and Limitations.” Planta Medica 82: 1225–1235. https://doi.org/10.1055/s-0042-111208.
10.1055/s-0042-111208
CAS PubMed Web of Science® Google Scholar
Quaresma, A., C. A. Yadró Garcia, J. Rufino, et al. 2023. “Semi-automated sequence curation for reliable reference datasets in ITS2 vascular plant DNA (meta-)barcoding.” Scientific Data 11: 129. https://doi.org/10.1038/s41597-024-02962-5.
10.1038/s41597-024-02962-5
Google Scholar
Richardson, R. T., D. B. Sponsler, H. McMinn-Sauder, and R. M. Johnson. 2020. “MetaCurator: A hidden Markov model-based toolkit for extracting and curating sequences from taxonomically-informative genetic markers.” Methods in Ecology and Evolution 11: 181–186. https://doi.org/10.1111/2041-210X.13314.
10.1111/2041-210X.13314
Web of Science® Google Scholar
Rognes, T., T. Flouri, B. Nichols, C. Quince, and F. Mahé. 2016. “VSEARCH: a versatile open source tool for metagenomics.” PeerJ 4: e2584. https://doi.org/10.7717/peerj.2584.
10.7717/peerj.2584
PubMed Web of Science® Google Scholar
Sayers, E. W., M. Cavanaugh, K. Clark, J. Ostell, K. D. Pruitt, and I. Karsch-Mizrachi. 2019. “GenBank.” Nucleic Acids Research 47: D94–D99. https://doi.org/10.1093/nar/gky989.
10.1093/nar/gky989
CAS PubMed Web of Science® Google Scholar
GBIF Secretariat. 2023. “ GBIF Backbone Taxonomy.” Accessed via GBIF.org on 2025-03-11.Checklist dataset. https://doi.org/10.15468/39omei.
10.15468/39omei
Google Scholar
Urumarudappa, S. K. J., C. Tungphatthong, P. Prombutara, and S. Sukrong. 2020. “DNA metabarcoding to unravel plant species composition in selected herbal medicines on the National List of Essential Medicines (NLEM) of Thailand.” Scientific Reports 10, no. 18: 259. https://doi.org/10.1038/s41598-020-75305-0.
10.1038/s41598?020?75305?0
PubMed Google Scholar
Ushey, K., and H. Wickham. 2023. “ renv: Project Environments.”
Google Scholar
Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis. 2nd ed. 2016. ed.,. Use R! Springer International Publishing: Imprint: Springer, Cham. https://doi.org/10.1007/978-3-319-24277-4.
10.1007/978-3-319-24277-4
Google Scholar
Winter, D. J. 2017. “rentrez: An R package for the NCBI eUtils API.” R Journal 9: 520. https://doi.org/10.32614/RJ-2017-058.
10.32614/RJ?2017?058
Web of Science® Google Scholar
Wirta, H., N. Abrego, K. Miller, T. Roslin, and E. Vesterinen. 2021. “DNA traces the origin of honey by identifying plants, bacteria and fungi.” Scientific Reports 11: 4798. https://doi.org/10.1038/s41598-021-84174-0.
10.1038/s41598-021-84174-0
CAS PubMed Web of Science® Google Scholar
Xu, B., X.-M. Zeng, X.-F. Gao, D.-P. Jin, and L.-B. Zhang. 2017. “ITS non-concerted evolution and rampant hybridization in the legume genus Lespedeza (Fabaceae).” Scientific Reports 7, no. 40: 057. https://doi.org/10.1038/srep40057.
10.1038/srep40057
PubMed Google Scholar
Zhang, J., X. Chi, J. Zhong, et al. 2022. “Extensive nrDNA ITS polymorphism in Lycium: Non-concerted evolution and the identification of pseudogenes.” Frontiers in Plant Science 13, no. 984: 579. https://doi.org/10.3389/fpls.2022.984579.
10.3389/fpls.2022.984579
Google Scholar
Zizka, A., F. Antunes Carvalho, A. Calvente, et al. 2020. “No one-size-fits-all solution to clean GBIF.” PeerJ 8: e9916. https://doi.org/10.7717/peerj.9916.
10.7717/peerj.9916
PubMed Web of Science® Google Scholar

Volume7, Issue3

May–June 2025

e70125

Automating the Curation of DNA Barcode Databases for Vascular Plants

ABSTRACT

1 Introduction

2 Methods

3 Cluster threshold impact on identification success

4 Threshold based rejection of database matches

5 Results

6 Cluster threshold impact on identification success

7 Similarity threshold impact on species-level identification success

8 Discussion

8.1 Evaluation and choosing of ITS plant reference databases

8.2 Incorporating plant occurrence information

8.3 Clustering and distance thresholds

9 Conclusions

Author Contributions

Conflicts of Interest

Open Research

Data Availability Statement

Supporting Information

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Automating the Curation of DNA Barcode Databases for Vascular Plants

ABSTRACT

1 Introduction

2 Methods

3 Cluster threshold impact on identification success

4 Threshold based rejection of database matches

5 Results

6 Cluster threshold impact on identification success

7 Similarity threshold impact on species-level identification success

8 Discussion

8.1 Evaluation and choosing of ITS plant reference databases

8.2 Incorporating plant occurrence information

8.3 Clustering and distance thresholds

9 Conclusions

Author Contributions

Conflicts of Interest

Open Research

Data Availability Statement

Supporting Information

References

Figures

References

Related

Information