Volume 21, Issue 3 pp. 703-720
RESOURCE ARTICLE
Full Access

MAUI-seq: Metabarcoding using amplicons with unique molecular identifiers to improve error correction

Bryden Fields

Bryden Fields

Department of Biology, University of York, York, UK

Search for more papers by this author
Sara Moeskjær

Sara Moeskjær

Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark

Search for more papers by this author
Ville-Petri Friman

Ville-Petri Friman

Department of Biology, University of York, York, UK

Search for more papers by this author
Stig U. Andersen

Corresponding Author

Stig U. Andersen

Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark

Correspondence

J. Peter W. Young, Department of Biology, University of York, York, UK.

Email: [email protected]

Stig U. Andersen, Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark.

Email: [email protected]

Search for more papers by this author
J. Peter W. Young

Corresponding Author

J. Peter W. Young

Department of Biology, University of York, York, UK

Correspondence

J. Peter W. Young, Department of Biology, University of York, York, UK.

Email: [email protected]

Stig U. Andersen, Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark.

Email: [email protected]

Search for more papers by this author
First published: 10 November 2020
Citations: 15
Bryden Fields and Sara Moeskjær contributed equally to this work.
Funding information

This work was funded by grant no. 4105-00007A from Innovation Fund Denmark (S.U.A.). Initial development of the method was funded by the EU FP7-KBBE project LEGATO (J.P.W.Y).

Abstract

Sequencing and PCR errors are a major challenge when characterizing genetic diversity using high-throughput amplicon sequencing (HTAS). We have developed a multiplexed HTAS method, MAUI-seq, which uses unique molecular identifiers (UMIs) to improve error correction by exploiting variation among sequences associated with a single UMI. Erroneous sequences are recognized because, across the data set, they are over-represented among the minor sequences associated with UMIs. We show that two main advantages of this approach are efficient elimination of chimeric and other erroneous reads, outperforming dada2 and unoise3, and the ability to confidently recognize genuine alleles that are present at low abundance or resemble chimeras. The method provides sensitive and flexible profiling of diversity and is readily adaptable to most HTAS applications, including microbial 16S rRNA profiling and metabarcoding of environmental DNA.

1 INTRODUCTION

The evaluation of DNA diversity in environmental samples has become a pivotal approach in microbial ecology (Birtel et al., 2015) and is increasingly also used to assess the distribution of larger organisms (Deiner et al., 2017). If a core gene can be amplified from environmental DNA with universal primers, the relative abundance of species in the community can be estimated from the proportions of species-specific variants among the amplicons. High-throughput amplicon sequencing (HTAS), often termed metabarcoding, is a cost-effective way to detect multiple species simultaneously within a range of environmental samples (Elbrecht & Leese, 2015; Fonseca, 2018; Gohl et al., 2016; Krehenwinkel et al., 2018; Poisot et al., 2013; Tessler et al., 2017). While shotgun sequencing of the whole community (metagenomics) can provide a richer description of the functions in a community, HTAS remains a more efficient tool for comparing the species diversity of a large number of community samples. Despite the extensive use of HTAS for interspecies ecological diversity studies, few investigations have utilized HTAS for intraspecies analysis (Kinoti et al., 2017; Poirier et al., 2018). As 16S rRNA amplicons are too highly conserved to estimate microbial within-species diversity, other target gene candidates need to be considered in order to sufficiently discern intraspecies sequence variation.

Many studies have evaluated the extent of PCR-based amplification errors and bias for HTAS diversity studies (Elbrecht & Leese, 2015; Gohl et al., 2016; Kebschull & Zador, 2015; Krehenwinkel et al., 2018). Numerous known PCR biases reduce the accuracy of diversity and abundance estimations, with the major concern being the inability to confidently distinguish PCR errors from natural sequence variation in environmental samples, which is a limiting factor for all studies.

Polymerase errors, production of chimeric sequences and the stochasticity of PCR amplification can be major causes of PCR errors (Edgar, 2016a; Edgar et al., 2011; Kebschull & Zador, 2015). Polymerase errors introduce new sequences into the template population during amplification. These sequence errors include not only substitutions but also insertions and deletions. On the other hand, chimeric sequences can also be introduced through the recombination of two partially homologous parent templates (Pääbo et al., 1990). This process predominantly occurs by incomplete template extension where partially extended sequences anneal to other templates in subsequent PCR cycles, but it is also known to occur through template switching (Kebschull & Zador, 2015; Odelberg et al., 1995). Furthermore, due to the stochastic nature of PCR amplification, all templates will not necessarily undergo replication in every PCR cycle, and this can significantly influence sequence representation of low-copy-number templates (Elbrecht & Leese, 2015). The use of proofreading polymerases, optimized DNA template concentration and reduced PCR cycle number have been suggested to reduce these errors (Gohl et al., 2016; Kebschull & Zador, 2015; Oliver et al., 2015).

To account for the introduction of sequence variants in PCR amplification, several sequence-classification approaches have been established to manage diversity estimates. The most common method is the use of operational taxonomic units (OTUs) in microbial diversity studies which analyse target gene sequences and cluster based on an arbitrary fixed similarity threshold: qiime (Bokulich et al., 2018); uparse (Callahan et al., 2016; Edgar, 2013; Fierer et al., 2017; Huse et al., 2010; Lindahl et al., 2013; Poisot et al., 2013). Within species boundaries this technique could dramatically reduce the resolution of naturally occurring sequence variation.

Most recent methods rely on the formation of sequence groups called amplicon sequence variants (ASVs): dada2 (Callahan et al., 2016); unoise3 (Edgar, 2016b; Fierer et al., 2017). This approach allows sequence resolution down to one nucleotide, which is advantageous for determining intraspecies allelic variation, but noise from PCR errors is also more evident. Variation induced by PCR errors often cannot be differentiated from rare natural allelic variation without the use of sequence denoising methods (Kebschull & Zador, 2015). dada2 relies on a quality-aware parametric error model, which is developed on a per-sequencing run basis. This increases the run time for most data sets compared to unoise3, which uses a one-pass technique (Nearing et al., 2018).

An approach that can reduce sequencing noise is to assign a unique molecular identifier (UMI) to every initial DNA template within an HTAS sample, which also enables evaluation of PCR amplification bias (Lundberg et al., 2013). Early applications addressed the detection of rare mutations (Jabara et al., 2011; Kinde et al., 2011). Additionally, the UMI provides a potential route to address polymerase errors in metabarcoding studies. The UMI is provided by a set of random bases in the gene-specific forward inner primer, which introduces a unique DNA sequence into every initial DNA template upstream of the amplicon region during the first round of amplification. Once all original DNA template strands are assigned a unique UMI, an outer forward primer and the gene-specific reverse primer can be used for further amplification. Consequently, all subsequent DNA amplified from the original template will have the same UMI, so the number of reads amplified from the initial template can be calculated. Grouping sequences by shared UMI allows identification of a consensus, which is assumed to be the correct sequence (Kou et al., 2016; Lundberg et al., 2013). UMI-based pipelines have been developed for various specialized purposes, including error correction and clonotyping of cell populations in immunogenomic studies (Egorov et al., 2015; Johansson et al., 2020; Shugay et al., 2014) and deduplication of RNA-sequencing (RNA-seq) reads (Islam et al., 2014). However, UMIs have previously only been used for a small number of single-amplicon investigations of multispecies communities (Faith et al., 2013; Hoshino & Inagaki, 2017; Johnston-Monje et al., 2016; Lundberg et al., 2013). A recent report evaluates their use with long reads from Nanopore (Karst et al., 2019).

Here, we present a method for metabarcoding using amplicons with unique molecular identifiers to improve error correction—MAUI-seq. The innovative approach is that we use variation among sequences associated with a single UMI to identify erroneous sequences, and we show that this improves error correction compared to non-UMI-based analysis using the state-of-the-art software packages dada2 and unoise3.

2 MATERIALS AND METHODS

2.1 Aim, design and setting

MAUI-seq is an HTAS method designed to assess genetic diversity within or across species, using global UMI-based error rates to detect potential PCR artefacts such as chimeras and single-base substitutions. To evaluate MAUI-seq, we compared its performance against two widely used ASV clustering methods, dada2 and unoise3, on DNA mixtures of two Rhizobium leguminosarum symbiovar trifolii (Rlt) strains to assess accuracy on a set of known sequences, and two sets of environmental samples of white clover root nodules to assess the performance on a complex set of sequences.

2.2 Preparation of DNA mixtures

Two Rlt strains (SM3 and SM170C) were chosen based on their recA, rpoB, nodA and nodD sequence divergence, with a minimum of 3-bp differences in the amplicon region required for each gene. Strains were grown on Tryptone Yeast agar (28°C, 48 hr). The culture was resuspended in 750 µl of the DNeasy Powerlyzer PowerSoil DNA isolation kit (Qiagen) and DNA was extracted following the manufacturer's instructions. DNA sample concentrations were measured using QUBIT (Thermofisher Scientific). DNA samples of the two strains were diluted to the same concentration and mixed in various ratios (Table S1).

2.3 Preparation of environmental samples

For Field-Samples-1 data, white clover (Trifolium repens) root nodules were collected from two locations: Store Heddinge, Denmark (six plots) and Aarhus University Science Park, Aarhus, Denmark (two plots) (Figure S4). The clover varieties sampled were Klondike (Store Heddinge) and wild white clover (Aarhus). One hundred large pink nodules were collected from four points on each plot, making a total of 32 samples. Nodules were stored at −20°C until DNA extraction. Nodule samples were thawed at room temperature and crushed using a sterile homogenizer stick. Crushed nodules were mixed with 750 µl Bead Solution from the DNeasy PowerLyzer PowerSoil DNA isolation kit (Qiagen) and DNA was extracted following the manufacturer's instructions. DNA sample concentrations were measured using a Nanodrop 3300 instrument (Thermofisher Scientific).

For Field-Samples-2 data, root nodules were additionally sampled from 13 white clover conventionally managed field trial plots at Store Heddinge, Denmark (Samples 1A–13A; File S2). All plots were sown under the same conditions in 2017. Three to 10 clover plants were sampled from one point in each plot and 100 large nodules were collected. Nodules were stored at −20°C, and DNA was extracted for each sample using the Qiagen DNeasy PowerLyzer PowerSoil DNA isolation kit, as above. Samples were processed independently with Platinum (nonproofreading) and Phusion (proofreading) polymerases to evaluate the method dependency on polymerase choice, as described in the following sections.

2.4 PCR and purification

Primer sequences were designed for two Rlt housekeeping genes, recombinase A (recA) and RNA polymerase B (rpoB), and for two Rlt-specific symbiosis genes, nodA and nodD (File S1: Table S1).

The three primers are a target-gene forward inner primer, a universal forward outer primer and a target-gene reverse primer. The concentration of the inner forward primer was 100-fold lower than the universal forward outer primer and the reverse primer (Figure 1) in order to reduce the competitiveness of this primer compared to the outer primer. The inner primer is essential for the first round of amplification, but its participation is undesirable in later rounds as it would assign a new unique UMI to an existing amplicon. The PCR mixture and thermocycler programme are provided (File S1: Tables S2 and S3). For all PCRs, a no-template control was included as a negative control to ensure no template contamination occurred.

Details are in the caption following the image
Primer design and method workflow. (a) Primer design using the sense strand of the target DNA template as an example. The amplicon region of interest should be no longer than 500 bp. The target-gene forward inner primer, universal forward outer primer and the target-gene reverse primer are all used in the initial PCR. The Nextera XT indices provide sample barcodes in a separate PCR step. The unique molecular identifier (UMI) region is shown in turquoise on the target-gene forward inner primer. (b) Sample preparation workflow. (c) MAUI-seq data analysis workflow [Colour figure can be viewed at wileyonlinelibrary.com]

PCRs were undertaken individually for each primer set using Platinum Taq DNA polymerase (Thermofisher Scientific; File S1: Table S2) and subsequently pooled and purified using AMPure XP Beads following the manufacturer's instructions (File S1: Table S5) (Beckman Coulter). Successful PCR amplification was confirmed by running a 0.5 × TBE 2% agarose gel at 90 V for 2 hr.

For the DNA mixture samples, PCRs were run in triplicate. DNA from single strains was also processed as a control to determine the level of cross-contamination between samples. Some samples were also amplified using Phusion High-Fidelity polymerase (Thermofisher Scientific), to evaluate whether use of a proofreading polymerase improved the quality of the results using the PCR programme described in File S1: Tables S2 and S4.

2.5 Nextera indexing for multiplexing and MiSeq sequencing

Samples were indexed for multiplexed sequencing libraries with Nextera XT DNA Library Preparation Kit v2 set A (Illumina) using the Phusion High-Fidelity DNA polymerase (Thermofisher Scientific). The PCR mixture and programme are detailed in File S1: Tables S6 and S7. Indices were added in unique combinations as specified in the manufacturer's instructions (Illumina).

The PCR product was purified on a 0.5 × TBE 1.5% agarose gel and extracted with the QIAQuick gel extraction kit (Qiagen; expected band length: ~454 bp). PCR amplicon concentrations were quantified using a GelAnalyzer2010a and normalized to 10 nM (Lazar, 2010). A pooled sample was quantified and checked for quality by using a Bioanalyzer (Agilent) before sequencing using Illumina MiSeq (2 × 300-bp paired end reads) by the University of York Technology Facility. A detailed protocol is available in File S1.

2.6 Read processing and data analysis

The pear assembler was used to merge paired ends (Zhang et al., 2014). Python scripts were used to separate the merged reads by gene (MAUIsortgenes.py) and to calculate allele frequencies both with and without the use of UMIs (MAUIcount.py). The scripts are available in the GitHub repository https://github.com/jpwyoung/MAUI. Sequences were clustered by UMI, and the number of unique UMIs was counted for each distinct sequence, provided that sequence had at least two more reads with that UMI than any other sequence. In cases where two or more sequences were associated with the same UMI, the second most abundant sequence was noted, and sequences that occurred more than 0.7 times as often as second sequences than as the main sequence associated with a UMI were filtered out of the results as putative PCR-induced chimeras or other errors. Sequences with primers removed (ignoring UMIs) were also clustered using dada2 (version 1.8; Callahan et al., 2016) and unoise3 (usearch version 11.0.667; Edgar, 2016b) with default settings. An overall read frequency filter of 0.1% was applied to dada2 and unoise3 outputs to match MAUI-seq accepted sequence filtering. All output abundance data are available in File S2. Scripts used for dada2 and unoise3 analyses are available in Files S3–S5. Output abundance data were then processed for statistical analysis and figure generation using various R packages (Files S3–S5; R Core Team, 2015; Wickham, 2016). Principal components were calculated with the “prcomp” function from the R core “stats” package using singular value decomposition to explain the Rhizobium diversity and abundance within each subplot sample. Differences in allele frequencies between samples were quantified using Bray–Curtis beta-diversity estimation using the “vegdist” function from the “vegan” r package. PERMANOVA tests were performed using the R package “adonis.” The empirical Bayes estimator of FST was calculated using the r package “FinePop” as previously described (Kitada et al., 2017).

2.7 Analysis of published bacterial 16S metabarcode data with UMIs

Read sequences were downloaded for three samples (sample 1 = P92, 2 = P57, and 3 = P70) from the study by Lundberg et al. (2013). The samples were obtained from Arabidopsis thaliana roots grown in Mason Farm soil, using 515F and 806R variable region 4 (V4) primers for bacterial 16S rRNA genes, with the addition of variable-length UMIs to both forward and reverse primers. These samples were chosen because they had been treated to reduce contamination with plant sequences and had a good number of reads and of UMIs, based on supplementary table 2a of Lundberg et al. (2013). The reads were downloaded from the SRA (sample 1 = SAMEA2173969, 2 = SAMEA2173972, 3 = SAMEA2173975) and forward and reverse reads assembled using pear. MAUIcount.py was modified to handle them using the regular expression r'([A-Z]*)(GAGTGCCAGC[AC]GCCGCGGTAA)([A-Z]*)(ATTAGA[AT]ACCC[CGT][AGT]GTAGTCCGT)([A-Z]*)'.

to separate each sequence into five parts: forward UMI, 515F primer, amplicon sequence, 806R primer and reverse UMI. Sequences that did not match the primers were discarded, and the two UMI parts were concatenated. MAUIcount then analysed the sequences without further modification, using the default values of read_diff = 2, reject_threshold = 0.7, add_limit = 0.001.

Forward and reverse reads were also processed separately using similar regular expressions to locate and remove the UMIs and primers, discarding those that did not match the primers, and the resulting read pairs were clustered using dada2 (version 1.8, default settings; Callahan et al., 2016). All output abundance data are available in File S2.

3 RESULTS

3.1 Laboratory protocol: UMI labelling and amplicon multiplexing

We developed a procedure (MAUI-seq) to amplify multiple target genes from environmental samples, while assigning a random UMI to each initial copy of a template. We opted for a straightforward protocol using a “one-pot” initiation and amplification system. Forward primers consist of two modules: an inner primer bearing the UMI and designed to amplify the target gene, and a universal outer primer that binds only to a linker on the inner primer (Figure 1a). We used a 12-base UMI that allowed over 4 million distinct sequences, which is adequate to ensure that duplicate use is negligible for samples with a few thousand sequenced UMIs. For studies with greater sequencing depth, a longer UMI can easily be designed. As a test case, we used MAUI-seq to investigate the genetic diversity of the nitrogen-fixing bacterium Rhizobium leguminosarum symbiovar trifolii (Rlt) by characterizing amplicons from the chromosomal core genes rpoB and recA, and the accessory plasmid-borne nodulation genes nodA and nodD. Each gene was amplified separately in a single reaction, using a target-specific inner forward primer (at low concentration) to assign the UMI and a universal outer primer (at high concentration) to amplify the resulting molecules (Figure 1a). The resulting amplicons were pooled and tagged by Nextera to identify the sample, then further pooled for high-throughput paired-end sequencing (Figure 1b). The full MAUI-seq step-by-step laboratory protocol can be found in File S1.

3.2 Analysis protocol: filtering using UMI-based error rates

The resulting paired-end reads were merged and then separated by gene prior to downstream analysis, where UMIs are critical in two ways. First, sequences are clustered by UMI, and the number of unique UMIs is counted for each distinct sequence, selecting the most abundant sequence associated with each UMI (Figure 1c). UMIs are discarded as ambiguous if the most abundant sequence does not have at least two reads more than the next in abundance. The most abundant sequence will usually be the correct one (Figure 2a, Case 1) but, because most UMIs are represented by just a small number of reads, an erroneous sequence can sometimes be sampled more often than the true sequence, so the primary sequence of the UMI becomes this erroneous sequence (Figure 2a Case 2). Second, we reasoned that it may be possible to eliminate these errors by using the UMIs to provide information on global error rates across all samples. We implemented this in MAUI-seq by noting both the most abundant (primary) and the second most abundant (secondary) sequence if two or more sequences were associated with the same UMI. MAUI-seq then distinguishes between true and erroneous sequences based on the ratio of primary and secondary occurrences of each sequence, eliminating sequences that show a high ratio (default is 0.7) of secondary to primary occurrences (Figure 1c and Figure 2b). The 0.7 threshold was chosen empirically, based on the ratios observed for known true and erroneous sequences, but it is a compromise because the incidence of secondary sequences varies across genes and studies. An examination of the results may suggest choosing different thresholds in other studies. Finally, globally rare sequences are discarded (default threshold is 0.1% averaged across samples—a lower threshold could be used if samples were sequenced to a greater depth). Python scripts for separating the genes and for the UMI analysis are available at https://github.com/jpwyoung/MAUI.

Details are in the caption following the image
Erroneous read formation and filtering. (a) Schematic showing the formation of different sequences with identical UMIs, and bias introduced when sampling for sequencing. (b) Example data showing the occurrence of real and chimeric rpoB sequences as primary and secondary sequence (log scale). S1 and S2: real sequences derived from two different rhizobium strains (SM170C and SM3). Chi1–4: chimeric sequences [Colour figure can be viewed at wileyonlinelibrary.com]

3.3 Validation using purified DNA mixed in known proportions

We first evaluated the accuracy of MAUI-seq by profiling DNA mixtures with known strain DNA ratios. DNA was extracted from two Rlt strains differing by a minimum of 3 bp in each of their recA, rpoB, nodA and nodD amplicon sequences, and the extracted DNA was mixed in different ratios (Table S1). After amplification and sequencing, assembled reads were assigned to their target gene and analysed using MAUI-seq and two programs frequently used for denoising of amplicon sequencing data, dada2 and unoise3 (Callahan et al., 2016; Edgar, 2016b). Because rare sequences have a high error rate, we discarded (for each of the three methods) sequences that fell below a threshold frequency of 0.1% of accepted sequences. The observed and expected strain ratios were highly correlated for all four genes across the three analysis methods, and we found that the performances of the proofreading (Phusion) and nonproofreading (Platinum) polymerases were gene-dependent, which could be due to differences in amplification efficiency for the four templates (Table 1; Figures S1–S4). MAUI-seq detected 98.5%–100% true sequences exactly matching those of the two strains in the mixture, while dada2 ranged from 89.7% to 100%, and unoise3 from 79.8% to 100% (Table 1). The better performance of MAUI-seq was due to more effective elimination of chimeras, which were especially abundant when the PCR was carried out using the Platinum nonproofreading polymerase (Table 1; Figures S1–S4). For the proofreading polymerase, dada2 detected 100% true sequences for all four genes, whereas MAUI-seq detected 99.03% for nodA, failing to eliminate three rare sequences that did not have sufficient secondary counts. This suggests that dada2 can perform equally well or even slightly better than MAUI-seq, when a proofreading polymerase is used to amplify DNA from a simple, two-component mix. The prevalence of secondary sequences varied with gene and polymerase: the secondary/primary ratio for accepted sequences was 0.0322 for rpoB using Phusion, but just 0.0002 for nodD using Platinum. When the ratio was very low, there were insufficient secondary counts for MAUI-seq to eliminate erroneous sequences effectively.

Table 1. Total number of detected sequences in the synthetic mix samples using MAUI-seq, dada2 and unoise3
Platinum Phusion Exp. seq
MAUI—seq dada2 unoise3 MAUI—seq dada2 unoise3
rpoB
n seq 2 3 4 2 2 2 2
%true 100 96.96 93.80 100 100 100
Cor.exp/obs 0.956 0.977 0.981 0.996 0.999 0.9998
chim.freq 0 0.07 0.13 0 0 0
recA
n seq 2 2 2 2 2 2 2
%true 100 100 100 100 100 100
Cor.exp/obs 0.984 0.991 0.989 0.948 0.952 0.947
chim.freq 0 0 0 0 0 0
nodA
n seq 6 5 4 5 2 4 2
%true 99.04 89.70 89.93 99.03 100 90.43
Cor.exp/obs 0.985 0.998 0.999 0.989 0.999 0.999
chim.freq 0.10 0.25 0.22 0.04 0 0.16
nodD
n seq 7 6 21 3 3 14 3
%true 98.49 93.93 90.10 100 100 79.83
Cor.exp/obs 0.998 0.998 0.995 0.990 0.998 0.995
chim.freq 0.05 0.05 0.13 0 0 0.11
All
%true-overall 99.76 93.73 91.93 99.74 100 91.71

Note

  • The percentage of true sequences is averaged over 23 samples for Platinum (nonproofreading) and 14 samples for Phusion (proofreading).
  • n seq is the total number of sequences occurring across all samples. %true is calculated by dividing the number of counts for the true sequences by the total number of counts accepted by the method. Cor.exp/obs is the Pearson correlation for the observed proportion of SM170C reads versus the expected proportion. Chim.freq is the proportion of chimeras compared to total reads at 0.5 expected proportion of sequences. Exp.seq is the expected number of detected sequences. And %true-overall is based on summed counts for all four genes.
  • a SM170C has a second copy of nodD (Cavassim et al., 2020).

To illustrate the effects of the various MAUI-seq stages, we take the example of the Platinum data for rpoB (the top panel in Table 1). There were 1,424,336 assembled read-pairs, leading to 56,176 valid UMIs (with at least two more reads for the primary sequence than for any other). These UMIs had 711 distinct primary sequences, but only six of these reached the default 0.001 abundance threshold. Of these six, all except two (the known input sequences) were rejected as erroneous (secondary/primary count> 0.7), so the method worked perfectly in this instance. If the abundance threshold was lowered to 0.0001, an additional 66 sequences were included, but all but two of these were rejected as erroneous. The two that were not rejected, but should have been, did not quite meet the criterion (secondary/primary counts were 4/6 and 5/9), illustrating that the method is less effective for rare sequences with low counts.

3.4 Validation using environmental samples

To test the method on more complex samples, we compared Rlt populations in root nodules from two locations in Denmark: a clover trial station in Store Heddinge on Zealand and a lawn at Aarhus University in Jutland (the Field-Samples-1 data set; Figure S5). One hundred nodules were pooled for each sample and each plot was sampled in four replicates, resulting in 12 plots × 4 replicates × 100 nodules amounting to 4,800 nodules in total. Platinum Taq polymerase enzyme (nonproofreading) was used for amplification. Each clover root nodule is usually colonized by a single Rhizobium strain, so a maximum of 100 unique sequences per gene is expected per sample.

For Field-Samples-1, the total number of distinct sequences for MAUI-seq and dada2 were in the same range as the number of distinct alleles observed in a population of 196 natural European Rlt isolates (Cavassim et al., 2020) (Table 2). In contrast, unoise3 produced a substantially higher number of distinct sequences, suggesting that its default filtering might be too lenient for our data (Table 2). The sequences accepted as true by MAUI-seq were nearly all also included in the dada2 and unoise3 outputs (Figure 3). On the other hand, dada2 and unoise3 both accepted a number of sequences that were filtered out by MAUI-seq, and many of these were eliminated by MAUI-seq because a high ratio of secondary to primary occurrences strongly suggested that they represent errors and not real sequences (Figure 3; File S2). To provide independent evidence as to whether sequences were likely to be genuine, we checked whether they matched (or differed by a single nucleotide from) known sequences in either a reference database of 196 natural European Rlt isolates (Cavassim et al., 2020), or the NCBI whole-genome shotgun database (Figure 3). Of the sequences rejected by MAUI-seq, 93% (108/116) did not have exact matches to known sequences. A few sequences that exactly matched known alleles were included by dada2 and unoise3 (seven and five sequences, respectively), but not by MAUI-seq. These sequences were not reported by MAUI-seq because their UMI counts were below the abundance threshold, not because the secondary/primary occurrence filter identified them as erroneous (Figure 3). The count threshold could be lowered to include rarer sequences, if the study required it.

Table 2. Total number of detected sequence clusters in root nodule samples (Field-Samples-1) using MAUI-seq, dada2 and unoise3 clustering and genetic differentiation between populations
Gene Method Detected sequence clusters F ST
Total Reference Exact blast Single nt Other
rpoB MAUI-seq 12 7 3 1 1 0.032
dada2 15 7 3 3 2 0.032
unoise3 30 7 2 7 14 0.012
Total reference 13
recA MAUI-seq 8 6 2 0.110
dada2 13 8 2 3 0.090
unoise3 14 5 2 2 5 0.028
Total reference 17
nodA MAUI-seq 9 8 1 0.369
dada2 18 12 1 1 4 0.191
unoise3 43 13 5 25 0.061
Total reference 14
nodD MAUI-seq 18 11 1 2 4 0.139
dada2 22 11 1 3 7 0.124
unoise3 57 11 1 4 41 0.031
Total reference 16
All genes MAUI-seq 47 32 6 4 5 0.139
dada2 68 38 7 10 13 0.105
unoise3 144 36 5 18 85 0.032
  • a Output sequences were classified into reference (100% identity in at least one of 196 Rhizobium leguminosarum symbiovar trifolii genomes; Cavassim et al., 2020)), exact blast (100% query coverage and 100% identity against the whole-genome shotgun contigs blast database), single nt (1 nt difference from either reference or exact blast match), and other. Total reference is the total number of detected sequences in the 196 Rhizobium leguminosarum symbiovar trifolii genomes (Cavassim et al., 2020).
  • b The population global FST (fixation index) is an estimate of genetic differentiation among populations based on relative allele abundance.
  • c Sequence types in 196 reference genomes (Cavassim et al., 2020).
Details are in the caption following the image
Amplicon diversity reported by MAUI-seq compared with the dada2 and unoise3 analysis pipelines. Data are for four genes from nodule samples from two geographical locations, Store Heddinge (1–6) and Aarhus (7–8). Letters A–D denote the replicates within each plot (Figure S5). Heatmap of the log10 transformed relative allele abundance of sequence clusters for individual genes. Lines connect identical sequences found by different clustering methods. Evidence that sequences are likely to be genuine is denoted by classifying them as reference (100% identity in at least one of 196 Rhizobium leguminosarum symbiovar trifolii genomes; Cavassim et al., 2020), exact blast (100% query coverage and 100% identity against the whole-genome shotgun contigs blast database), single nt (1 nt difference from either reference or exact blast match) and other. Sequences not reported by MAUI-seq were classified as sec/pri ratio (rejected as erroneous because of a high secondary to primary ratio), low UMI count (not reported because too rare) and not found by MAUI (no accepted UMIs) [Colour figure can be viewed at wileyonlinelibrary.com]

The allele frequency distributions were different at Aarhus and Store Heddinge (Figure 3), and the two sites were clearly separated by the first principal component in a principal component analysis (PCA) for MAUI-seq, dada2 and unoise3 sequences (Figure 4; Figures S6–S8). The amplicon sequencing has sufficient resolution to characterize geospatial variation in allele frequencies. For example, MAUI-seq, dada2 and unoise3 can all clearly identify several highly abundant sequences from one location that are either absent or present at very low frequency in samples from the other location (Figure 3). To quantify the genetic differentiation between the Aarhus and Store Heddinge sites, we calculated fixation indices (FST). Considering all four target genes combined, the MAUI-seq output resulted in the highest FST value followed by dada2 and unoise3 (Table 2, Figure 4; Figures S9–S11). For all individual genes, MAUI-seq also produced the highest FST estimates, and the differences were especially pronounced for nodA, which also showed the highest overall level of differentiation (Table 2; Figures S9–S11). The lower genetic differentiation estimated based on dada2 and unoise3 results, compared to those of MAUI-seq, reflects the inclusion of an increased number of erroneous sequences, which are less differentiated between the two sampled sites than the real sequences (Figure 3).

Details are in the caption following the image
Genetic differentiation between populations visualized by principal component analysis (PCA) (a–c) and FST (d–f) of Rlt diversity in root nodule samples (eight sites, four replicates). Three analysis pipelines are compared: MAUI-seq (a,d), dada2 (b, e) and unoise3 (c, f). The PCA was based on log10 transformed relative allele abundance. FST analysis was based on relative allele abundance. Data from all four genes (rpoB, recA, nodA, and nodD) were included in the analysis [Colour figure can be viewed at wileyonlinelibrary.com]

Because it was clear from the DNA mixture experiment that the choice of DNA polymerase could significantly affect error rates, we sampled root nodules from 13 additional clover field plots (the Field-Samples-2 data set) and amplified each sample (a pool of 100 root nodules) using Platinum and Phusion polymerases in parallel. For samples amplified using Platinum, MAUI-seq detected fewer sequences than dada2 and unoise3 for the two core genes, rpoB and recA, but the same number of reference sequences were detected (Table 3). dada2 included two chimeric sequences that were filtered out by MAUI-seq due to a high ratio of secondary to primary occurrences (File S2). unoise3 detected twice as many sequences as dada2 and MAUI-seq for the accessory genes, nodA and nodD, but most of the additional sequences had no associated UMIs and were classified as “other” (Table 3; File S2). For samples amplified using Phusion, MAUI-seq and dada2 detected a similar number of sequences (Table 3). All nine unoise3 rpoB sequences that were not accepted by either MAUI-seq or dada2 (File S2) are putative chimeric sequences with two parental sequences of higher abundance. For nodA, MAUI-seq includes four sequences that have a single, synonymous nucleotide difference from a reference sequence, but all have a good ratio of secondary to primary reads, so we hypothesize that these are true sequences. Some reference or exact blast hit sequences were included by dada2 but not by MAUI-seq because their abundance was estimated by dada2 to be above the 0.001 threshold, but MAUI-seq estimated that they were rarer.

Table 3. The effect of polymerase choice
Gene Platinum Phusion
MAUI-seq dada2 unoise3 MAUI-seq dada2 unoise3
rpoB
Total 16 24 26 15 15 20
Reference 9 9 7 8 9 7
Exact blast 3 3 2 3 3 2
Single nt 3 7 8 3 2 5
Other 1 5 9 1 1 6
recA
Total 9 10 12 8 9 10
Reference 5 5 4 5 5 4
Exact blast 0 1 1 0 1 1
Single nt 3 3 3 3 2 3
Other 1 1 4 0 1 2
nodA
Total 18 14 35 17 11 34
Reference 7 10 8 9 9 9
Exact blast 0 1 0 0 0 0
Single nt 6 1 4 6 1 4
Other 5 2 22 2 1 21
nodD
Total 20 17 46 27 24 71
Reference 10 12 12 16 16 15
Exact blast 0 0 0 0 0 0
Single nt 6 3 6 5 4 6
Other 4 2 28 6 3 50

Note

  • The total number of detected sequence clusters in root nodule samples (Field-Samples-2) amplified using Phusion (proofreading) or Platinum (nonproofreading) polymerases. Sequences were clustered using MAUI-seq, dada2 and unoise3.
  • a Output sequences were classified into reference (100% identity in at least one of 196 Rhizobium leguminosarum symbiovar trifolii genomes; Cavassim et al., 2020), exact blast (100% query coverage and 100% identity against the whole-genome shotgun contigs blast database), single nt (1 nt difference from either reference or exact blast match), and other.

Both MAUI-seq and dada2 identify and remove sequences that appear to be errors (base substitutions or chimeras), but they use completely different evidence. As a result, they do not always make the same decision, as illustrated for a small set of representative data in Table 4 (the rpoB sequences amplified by Phusion). While dada2 examines the sequences and rejects those that are likely to be generated from more abundant sequences in the sample, MAUI-seq does not use the actual sequence but bases decisions on how frequently a sequence occurs as a secondary sequence with the same UMI as another (primary) sequence. Sequences ranked 5 and 6 (Table 4) are both potential chimeras of the more abundant sequences 1–4. Both dada2 and MAUI-seq reject sequence 6 and accept sequence 5. Sequence 6 has a secondary/primary ratio of 103/118, which is above the default threshold of 0.7, so MAUI-seq rejects it as a probable error. On the other hand, the ratio for sequence 5 is 71/229. This is well below the threshold, but it is higher than other sequences with a similar primary count, for example sequence 9 (15/270). A possible explanation is that some of the reads for sequence 5 are generated as chimeras but others are genuine, as it is entirely plausible that new alleles are generated by recombination between existing alleles. To some extent, MAUI-seq compensates for this because it allocates sequence 5 a relatively low count and hence lower ranking (8) than it has in the raw reads or the dada2 analysis. There are two further sequences, 10 and 29, that are rejected by dada2 as potential chimeras but accepted by MAUI-seq (File S2 Field-Samples-2-phusion-rpoB); in both cases they have secondary sequence counts well below the threshold, so MAUI-seq accepts them as genuine. dada2 included an rpoB sequence that does not have any associated UMIs (sequence 41), and appears to be a chimera of two more abundant sequences (sequence 3/4/5 and sequence 11) (Table 4). MAUI-seq counts UMIs, not individual reads, and the default setting is to require that the primary sequence has at least two more reads than the next most frequent sequence (if any) that has the same UMI. This enriches for genuine sequences, which are generally more abundant than errors, but it means, of course, that the number of counts is much lower than the number of reads. In fact, for this particular set of data, the number of UMIs is orders of magnitude smaller than either the raw reads or the dada2 count, although still sufficient to provide good estimates of the relative abundance of the sequences that make up the bulk of the population. The main reason for the low UMI count is that the number of reads per UMI was suboptimal in these data for the rpoB gene: only 18% of the UMIs had more than one read, and MAUI-seq discards single-read UMIs by default. By contrast, in the equivalent data for the recA gene in the same study (File S2 Field-Samples-2-phusion-recA), 37.5% of UMIs had more than one read, making more effective use of the available sequence reads.

Table 4. A comparison between dada2 and MAUI-seq for a subset of the Field-Samples-2 data summarized in Table 3: the rpoB sequences from samples amplified by Phusion (proofreading) polymerase [Color table can be viewed at wileyonlinelibrary.com]
Raw reads MAUI-seq dada2
Rank Count Rank UMI primary count UMI secondary count Accepted Rank Count Accepted
1 99,431 1 7,459 197 yes 1 54,758 yes
2 86,751 2 7,067 155 yes 2 48,402 yes
3 70,318 3 3,668 95 yes 3 44,412 yes
4 47,337 4 1,898 106 yes 4 28,339 yes
5 13,190 8 229 71 yes 5 7,854 yes
6 11,786 9 118 103 no none NA no
7 10,490 5 489 19 yes 6 6,009 yes
8 9,630 6 362 13 yes 7 5,414 yes
9 4,738 7 270 15 yes 8 2,757 yes
10 4,290 12 62 15 yes none NA no
11 3,223 11 90 3 yes 9 2,041 yes
20 1,950 10 96 6 yes 10 981 yes
29 1,504 13 42 10 yes none NA no
39 1,063 14 35 2 yes 12 618 yes
41 946 none 0 0 11 721 yes
43 826 15 34 0 yes 13 434 yes
51 567 16 22 3 yes 14 341 yes
63 415 24 7 0 (yes) 15 208 yes

3.5 Analysis of published bacterial 16S metabarcode data with UMIs

To demonstrate that the MAUI-seq method can be applied to studies of bacterial community diversity using universal primers that target 16S rRNA genes, we analysed data from Lundberg et al. (2013), which describes the bacterial community on Arabidopsis thaliana roots growing in soil. We chose three samples that had good coverage and a good number of UMIs, and modified the MAUIcount.py program to handle the variable-length split UMIs (see Methods). With our standard settings, MAUI-seq identified 145 sequences, while dada2 identified 125 sequences (Figure 5a,b; File S2). The overlap between sequences identified by the two methods was 106 sequences, which includes most of the highly abundant sequences. Of the low-abundance MAUI sequences, 13 were identified by dada2, but removed due to the 0.1% cut-off. An additional 19 sequences were sufficiently abundant, but were flagged as probable errors by MAUI-seq because they occurred as secondary sequences more than 0.7 times as often than as primary sequences (Figure 5c). All 19 of these sequences were close to abundant sequences but had undetermined bases represented by “N,” which were absent from the remaining sequences. It should be emphasized that MAUI-seq does not directly examine sequence quality or the presence of “Ns,” but recognized these sequences by their distribution across UMIs. dada2 does not include any of these 19 sequences, as the default settings do not allow reads containing”Ns,” meaning they are filtered out prior to analysis with dada2. The incidence of “Ns” was much higher in these 16S sequences than in our Rhizobium data, where they were never found in sequences above the abundance threshold, but on the other hand, chimeras were not evident in the 16S data, where all abundant sequences had > 99% identity to 16S sequences from bacterial isolates (or, in a few cases, eukaryotic mitochondria). dada2 removes seven accepted MAUI-seq sequences as “bimeras” (Figure 5b). These sequences appear to be genuine as they all have an exact blast hit from a cultured bacterial sample and a low secondary to primary UMI count ratio (Figure 5c).

Details are in the caption following the image
16s amplicon diversity reported by MAUI-seq compared with the dada2 analysis pipeline. (a, b) Heatmap of the log10 transformed relative allele abundance of sequence clusters. Lines connect identical sequences found by different clustering methods. Evidence that sequences are likely to be genuine is denoted by classifying them as exact blast (100% query coverage and 100% identity against the whole-genome shotgun contigs blast database), single nt (1 nt difference from either reference or exact blast match) and other. Sequences not reported by MAUI were classified as low UMI count (not reported because too rare). Sequences not identified by dada2 after “bimera” filtering were classified as dada2 bimera. (c) Secondary/primary UMI count ratio for 16S sequences identified by MAUI-seq above the 0.001 threshold. Sequences with a sec/pri ratio of >0.7 were deemed to be erroneous [Colour figure can be viewed at wileyonlinelibrary.com]

4 DISCUSSION

We propose a new HTAS method (MAUI-seq) designed to assess genetic diversity within or across species. It uses global UMI-based error rates to detect potential PCR artefacts such as chimeras and single-base substitutions more robustly than the widely used ASV clustering methods, dada2 and unoise3. The approach is applicable to any study of amplicon diversity, including community diversity estimates based on 16S rRNA and other metabarcoding surveys using environmental DNA.

4.1 Using UMIs to filter out chimeras and other errors

In the MAUI-seq approach, UMIs are used to reduce errors in two distinct ways. Because all reads with the same UMI should, in principle, be derived from the same initial template copy, any variation among them reflects errors. In some implementations, a consensus sequence is calculated (Kou et al., 2016), but we adopt the simpler approach of accepting the most abundant sequence, which will usually give the same result. Requiring more than one identical read before accepting a UMI creates an important quality filter that greatly reduces the number of rare (and usually erroneous) sequences, but as more reads are required, an increasing number of the original reads are discarded and the number of accepted counts declines. To strike a balance between quantity and quality, we chose to count a sequence provided it had at least two more reads than the next most frequent sequence with the same UMI, but this threshold could be adjusted if, for example, a markedly larger number of reads were available.

While the most abundant sequence associated with a UMI will usually be the correct one, sometimes an erroneous sequence will predominate among the small number of reads actually sequenced, leading to these sequences being included among the recorded counts. These errors can be detected, however, by aggregating information across the whole set of samples. When a UMI is associated with more than one sequence, the secondary sequences are most often erroneous, so sequences that are relatively more abundant as secondary sequences than as the primary sequences associated with UMIs are likely to be erroneous. We recorded the number of times each sequence was found as the second sequence associated with a UMI, and found empirically that a suitable threshold for accepting sequences as genuine was that they occurred less than 0.7 times as often as secondary sequences as they occurred as primary sequences. This threshold can, however, be adjusted to reflect the error distribution observed in a particular study. We found that this approach was very effective in identifying known errors, particularly chimeras, which were generally the most abundant errors. Chimeras were rejected more effectively by MAUI-seq than by the two established ASV clustering methods, dada2 and unoise3. Both of these latter two programs rely on de novo rejection of sequences that could be constructed as recombinants of other sequences that are more abundant in the sample (Edgar, 2016a). This method risks rejecting sequences that appear to be recombinant but are genuine alleles, which may not be uncommon, particularly in intraspecific samples. Our approach, by contrast, uses information on the observed error rates in the data (detected using UMIs) to decide whether a sequence is likely to be genuine, regardless of its actual sequence and relationship to other sequences. Sequences that could be generated as chimeras, or that differ by a single nucleotide from a more abundant sequence, may be accepted as genuine if they are more abundant than expected from their rate of occurrence as minor sequences associated with UMIs. In our study, this approach eliminated many known errors and substantially improved our confidence in the remaining data, providing a powerful additional reason for using UMIs in metabarcoding studies of all kinds. Because the detection of errors does not depend on the type of error, we would expect it to be equally effective for other types of error, such as homopolymers in pyrosequencing data, but we have no suitable data to confirm this. While we found that a simple empirical threshold was effective, we noticed that the proportion of secondary sequences varied markedly across studies and genes, suggesting that an adjustable threshold might give further improvement. A positive control using a synthetic mix of sequences expected to generate chimeras would be a useful standard for assessing the appropriate threshold for a particular experimental run. A future development might be to use the abundance of minor sequences associated with UMIs to generate a statistical model of error processes that would provide a firmer theoretical basis for the classification of sequences and could access information provided by the UMIs that are currently discarded because they have too few reads.

Furthermore, while we compared MAUI-seq to the default settings of dada2 and unoise3, several adjustable parameters are included in the dada2 pipeline which could further fine-tune chimera detection for specific data sets (Callahan et al., 2016). For example, we utilized the default “consensus” method for dada2, whereby sequences are identified as bimeric on a sample-by-sample basis and a consensus decision is subsequently made based on how frequently the sequence is flagged as chimeric in all other samples. We could alternatively have chosen the “pooled” method, in which all sample sequences are pooled for bimera identification. Additionally, the algorithmic parameters of dada2 can be user-adjusted further to modify the stringency of chimeric sequence detection, such as by altering the required fraction of samples which must identify a sequence as bimeric, or by changing the accepted abundance of parent sequences of potential chimeras. On the other hand, unoise3 has no user-adjustable parameters within the chimera identification section of the pipeline, as the unoise3 pipeline uses the chime2 de novo algorithm as a built-in chimera detection method (Edgar, 2016a).

The MAUI-seq approach requires very modest computational resources. Once forward and reverse reads have been assembled and the four genes separated, the MAUIcount script carried out dereplication, error detection and counting for the 32 samples shown in Figure 3 in about 2 min on a desktop computer (details in Table S2). From the same point, unoise3 required 286 min. dada2 works with unassembled reads, but the comparable error-detection and dereplication stage took 52 min.

4.2 Using UMIs to reduce amplification bias

One motivation for the use of UMIs is to obtain more accurate relative abundance data by eliminating possible sequence-specific bias in the PCR amplification, which may be introduced by variation in polymerase and primer affinity for some DNA templates. Indeed, we observed that the Platinum polymerase preferentially amplified the SM170C rpoB allele, whereas the Phusion enzyme did not have this bias (Table 1; Figure S1a–c). Allele variant bias was also shown for other target genes, although the ranking of the two enzymes was not always the same (Table 1; Figures S1–S4). However, in our study, the use of UMIs did not correct the allele bias. This suggests that the bias was present in the initial round of copying using the target-specific primer, rather than in the subsequent amplification rounds. For our case study, at least, the choice of polymerase was much more important for accurate relative abundance data than the use of UMIs. The main advantage of UMIs was, rather, the ability to remove most sequencing errors, as discussed in the preceding section.

4.3 Optimization of the protocol

As with any metabarcoding project, the first important step is to design the primers carefully to amplify the entire target community with minimum bias, and we used a large database of known gene sequences to achieve this. Another consideration that is shared with other approaches is the choice of polymerase for PCR. For the samples studied here, with abundant template DNA, the proofreading enzyme was clearly superior in performance, although more costly. On the other hand, this enzyme may provide less robust amplification when the template is weak, as we have observed in another project aimed at rhizobial DNA in soil (Boivin et al., 2020). The use of UMIs introduces other design considerations. We used 12 random nucleotides (with some constraints), giving over four million potential UMI sequences, which was sufficient for the scale of our studies, but it would be simple to increase the UMI length if greater sequencing depth was planned. In any metabarcoding study, the choice of sequencing depth is, to some degree, made blindly because the diversity of templates is not known in advance, but UMI-based approaches need greater depth because it is UMIs that are counted, not reads, and the aim is to have several reads per UMI. There are many factors that affect the average number of reads per UMI, but our study is encouraging in that, without separate optimization, all of our target genes in all of our samples gave usable data, even though the number of reads per UMI was suboptimal in most cases. Given a fixed sequencing effort, reads per UMI could, if necessary, be increased by reducing the concentration of the forward UMI-bearing primer and/or of the sample DNA so that fewer distinct UMIs were initiated. With our parameters, at least two reads are needed before a UMI is counted, and a sufficient fraction of the UMIs need at least four reads so that some will have a secondary sequence as well as the primary sequence (with at least two reads more than the secondary).

4.4 Wider applicability of the MAUI-seq approach

We have shown that the MAUI-seq analysis can readily be extended to published data for bacterial community characterization using 16S rRNA gene metabarcoding, provided that these have UMIs (Figure 5). While these samples had a comparable number of reads and of sequences per UMI to our data, they differed in some respects. The average quality of the reads was lower, and the errors that were detected were unresolved bases. Chimeras were not seen, perhaps because the amplicons were much more diverse in sequence, so there were few stretches of identity that would promote template switching. This demonstrates that the rarer reads associated with a UMI convey an exploitable “error” signal, regardless of whether these errors are chimeras or base misreads. For this data set, MAUI-seq identified more rare sequences than dada2, and reliably removed erroneous sequences using the UMI “error” signal, whereas dada2 relied on hard-coded parameters that exclude all reads containing “Ns,” thereby reducing the number of input reads.

While the benefits of MAUI-seq include increased accuracy and sensitivity for detection of true low-abundance variants and reduced false-positive variant verdicts in both inter- and intraspecies communities, the application of UMIs for error correction does introduce some extra cost. This is largely due to the requirement for greater sequencing depth because relative abundance is determined from UMI counts, which are much lower than read counts. Given the advantages of the UMI approach, the increased cost becomes more acceptable as sequencing costs continue to decline. A barcoded primer is required, but we were able to obtain satisfactory results without increasing the complexity of the laboratory manipulations beyond that needed for standard HTAS.

4.5 Future directions for MAUI-seq

HTAS is a valuable and widely used approach for the study of microbial community diversity, but handling erroneous sequences introduced by the amplification and sequencing procedures has always been challenging. The use of UMIs allows MAUI-seq to greatly reduce the incidence of errors through two mechanisms. First, the requirement that a UMI is associated with at least two identical reads eliminates many rare sequences that are predominantly erroneous. Second, sequences that are frequently generated as errors can be identified and removed because they occur unexpectedly often as minor components associated with UMIs that are assigned to more abundant sequences. These mechanisms are independent of any reference database and can recognize and retain genuine alleles that differ by a single nucleotide or match a potential chimera. This makes MAUI-seq particularly suited to studies of intraspecific variation, where the range of sequence divergence may be limited and not fully known in advance. However, the efficient elimination of erroneous sequences is also important in community studies such as those based on widely used 16S primers, and MAUI-seq is readily adaptable to this field, as we have demonstrated by analysis of published data. It should also be adaptable to environmental DNA studies, as it has been used to characterize rhizobial diversity based on nodD sequences amplified directly from soil (Boivin et al., 2020). This entailed the amplification of very low amounts of target DNA, and a nonproofreading DNA polymerase had to be used to obtain sufficient sensitivity, but we have shown in this study that MAUI-seq is proficient at dealing with the resulting higher error rate. The analysis pipeline is very fast because no sequence alignment or database searching is involved; only the accepted final sequences need to be characterized by comparison to a reference database.

Most HTAS studies report the relative proportions of the taxa in a community, but it would sometimes be valuable to estimate the absolute abundance of the microbes in the environmental sample. UMIs can potentially provide such information, if the initial template copying is carefully controlled so that the total number of distinct UMIs reflects the number of templates (Hoshino & Inagaki, 2017; Kivioja et al., 2012). While this would necessitate some additional steps at the start of the experimental protocol, it should still be possible to analyse the resulting sequences using the error-removal approaches provided by MAUI-seq. Alternatively, absolute abundance can be estimated by adding a spike of a known quantity of a recognizable target sequence to the sample before processing (Edgar, 2017; Kebschull & Zador, 2015; Palmer et al., 2018).

The addition of a UMI shortens the maximum length of target sequence that can be read, and the counting of UMIs rather than reads requires a higher depth of sequencing, but these limitations are increasingly unimportant as improvements in sequencing technology lead to increasing length, enabling long-read amplicon sequencing (Karst et al., 2019; Kumar et al., 2018), and numbers of reads. As implemented in MAUI-seq, UMIs are very effective in reducing the errors inherent in HTAS, and have the potential to improve the quality of any amplicon-based study of diversity. There are several parameters (minimum difference between primary and secondary reads of a UMI, ratio of secondary to primary reads of a sequence, minimum relative abundance) that are user-specified and can be adjusted to suit each study. In principle, it should be possible to optimize these using a statistical model of mutational errors, like that implemented in dada2 (Callahan et al., 2016), and of chimera formation, which is not modelled in detail by dada2. The UMIs provide an additional source of information to parameterize the model, linking sequences that have a common origin. Such a model would be complex, however, and parameterizing and testing it would need a data set that was optimized for the purpose. At the same time, it would also be interesting to explore the use of UMIs at both ends of the amplicon, which would provide an additional means to identify and eliminate chimeras (Burke & Darling, 2016).

5 CONCLUSIONS

Some potential advantages of incorporating UMIs in amplicon diversity studies have been explored previously, but here we propose a new way to use the extra information that they provide. Error processes lead to more than one sequence being associated with the same UMI, and this can be used to identify erroneous sequences regardless of their relative abundance or their relationship to other sequences in the sample. The method is experimentally and computationally straightforward, and we demonstrate its effectiveness using known strain mixtures and real environmental samples. It allows decontamination of amplicon sequence data by flagging chimeras and other errors, and can readily be adapted to any target gene of interest in microbiome studies.

ACKNOWLEDGEMENTS

We thank David Sherlock for his experimental expertise in developing this method, the University of York Technology Facility for sequencing, Simon Kelly for dada2 expertise, Asger Bachmann, Terry Mun, M. Izabel A. Cavassim and Marni Tausen for preliminary data analysis and script development, and DLF for access to their clover field trials.

    CONFLICT OF INTERESTS

    The authors declare that they have no competing interests.

    AUTHOR CONTRIBUTIONS

    Conceptualization: J.P.W.Y.; methodology: J.P.W.Y., S.U.A.; software: B.F., S.M., J.P.W.Y.; validation: B.F., S.M., J.P.W.Y., S.U.A.; formal analysis: B.F., S.M., J.P.W.Y.; investigation: B.F., S.M.; resources: J.P.W.Y., S.U.A., V.-P.F.; data curation: B.F., S.M., J.P.W.Y.; writing—original draft: B.F., S.M.; writing—review and editing: B.F., S.M., J.P.W.Y., S.U.A., V.-P.F.; visualization: B.F., S.M.; supervision: J.P.W.Y., S.U.A., V.-P.F.; project administration: J.P.W.Y., S.U.A.; funding acquisition: J.P.W.Y., S.U.A.

    DATA AVAILABILITY STATEMENT

    Raw Illumina reads are available in the SRA repositories with accession numbers SRP221010: Synthetic mix (Fields et al., 2020a) and Field-Samples-1 (Fields et al., 2020b); and SRP238323: Field-Samples-2 (Fields et al., 2020c). MAUI-seq scripts are available in the GitHub repository https://github.com/jpwyoung/MAUI. A detailed protocol for sampling, sample preparation, and read processing is available in File S1. Scripts used for dada2, unoise3, and figure generation are available in Files S3–S5. Detailed output sequences for all three methods are available in File S2.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.