Volume 25, Issue 5 e13816
RESOURCE ARTICLE
Full Access

Toward absolute abundance for conservation applications: Estimating the number of contributors via microhaplotype genotyping of mixed-DNA samples

Yue Shi

Corresponding Author

Yue Shi

College of Fisheries and Ocean Sciences, University of Alaska Fairbanks, Juneau, Alaska, USA

Wisconsin Cooperative Fishery Research Unit, College of Natural Resources, University of Wisconsin-Stevens Point, Stevens Point, Wisconsin, USA

Correspondence

Yue Shi, College of Fisheries and Ocean Sciences, University of Alaska Fairbanks, Juneau, AK, USA.

Email: [email protected]

Search for more papers by this author
Cory M. Dick

Cory M. Dick

California Cooperative Fish and Wildlife Research Unit, Department of Fisheries Biology, Humboldt State University, Arcata, California, USA

Search for more papers by this author
Kirby Karpan

Kirby Karpan

National Oceanic and Atmospheric Administration, National Marine Fisheries Service, Alaska Fisheries Science Center, Auke Bay Laboratories, Juneau, Alaska, USA

Search for more papers by this author
Diana Baetscher

Diana Baetscher

National Oceanic and Atmospheric Administration, National Marine Fisheries Service, Alaska Fisheries Science Center, Auke Bay Laboratories, Juneau, Alaska, USA

Search for more papers by this author
Mark J. Henderson

Mark J. Henderson

U.S. Geological Survey, California Cooperative Fish and Wildlife Research Unit, Department of Fisheries Biology, Humboldt State University, Arcata, California, USA

Search for more papers by this author
Suresh A. Sethi

Suresh A. Sethi

U.S. Geological Survey, New York Cooperative Fish and Wildlife Research Unit, Cornell University, Ithaca, New York, USA

Search for more papers by this author
Megan V. McPhee

Megan V. McPhee

College of Fisheries and Ocean Sciences, University of Alaska Fairbanks, Juneau, Alaska, USA

Search for more papers by this author
Wesley A. Larson

Wesley A. Larson

National Oceanic and Atmospheric Administration, National Marine Fisheries Service, Alaska Fisheries Science Center, Auke Bay Laboratories, Juneau, Alaska, USA

U.S. Geological Survey, Wisconsin Cooperative Fishery Research Unit, College of Natural Resources, University of Wisconsin-Stevens Point, Stevens Point, Wisconsin, USA

Search for more papers by this author
First published: 31 May 2023
Citations: 3
Handling Editor: Yibo Hu

Abstract

Molecular methods including metabarcoding and quantitative polymerase chain reaction have shown promise for estimating species abundance by quantifying the concentration of genetic material in field samples. However, the relationship between specimen abundance and detectable concentrations of genetic material is often variable in practice. DNA mixture analysis represents an alternative approach to quantify specimen abundance based on the presence of unique alleles in a sample. The DNA mixture approach provides novel opportunities to inform ecology and conservation by estimating the absolute abundance of target taxa through molecular methods; yet, the challenges associated with genotyping many highly variable markers in mixed-DNA samples have prevented its widespread use. To advance molecular approaches for abundance estimation, we explored the utility of microhaplotypes for DNA mixture analysis by applying a 125-marker panel to 1179 Chinook salmon (Oncorhynchus tshawytscha) smolts from the Sacramento-San Joaquin Delta, California, USA. We assessed the accuracy of DNA mixture analysis through a combination of mock mixtures containing DNA from up to 20 smolts and a trophic ecological application enumerating smolts in predator diets. Mock DNA mixtures of up to 10 smolts could reliably be resolved using microhaplotypes, and increasing the panel size would likely facilitate the identification of more individuals. However, while analysis of predator gastrointestinal tract contents indicated DNA mixture analysis could discern the presence of multiple prey items, poor and variable DNA quality prevented accurate genotyping and abundance estimation. Our results indicate that DNA mixture analysis can perform well with high-quality DNA, but methodological improvements in genotyping degraded DNA are necessary before this approach can be used on marginal-quality samples.

1 INTRODUCTION

Molecular tools can provide important insights into species abundance, which is critical for many ecological and conservation applications, including population dynamics (Bravington et al., 2016; Roy et al., 2014), assessing dietary profiles (e.g. Shi et al., 2021), investigating community composition (Gehri et al., 2021) and biomonitoring (Darling & Blum, 2007). One popular ecological application is molecular diet analysis of faecal samples or stomach content samples to estimate the composition of prey species consumed by predators (e.g. King et al., 2008). The two primary methods used in molecular diet analysis are quantitative polymerase chain reaction (qPCR) and metabarcoding (Deiner et al., 2017; Harper et al., 2018; Pompanon et al., 2012). While both methods are useful for prey species detections, qPCR assays generate quantitative estimates of specific target species, whereas metabarcoding can provide a relative abundance of species within certain taxa (e.g. vertebrates, fish and zooplankton).

While the majority of molecular diet studies have focussed on prey species presence/absence, there has been substantial interest in extending these tools to estimate species abundance (Rourke et al., 2022). Past studies have demonstrated that both qPCR and metabarcoding can provide information on the amount of input DNA in a sample that can theoretically be used to estimate species abundance (Hänfling et al., 2016; Shelton et al., 2019). While there is often a positive correlation between qPCR or metabarcoding results and abundance in laboratory settings, the correlation is often variable and can be weak in natural environments (Fonseca, 2018; Kelly et al., 2019; Yates et al., 2019). qPCR can directly estimate the amount of target DNA present in a sample and therefore is not influenced by the mixture of species in a sample (Nathan et al., 2014). Metabarcoding, on the other hand, produces relative read abundance for each species and is effective for determining the major taxa in a mixed species assemblage. A weak quantitative relationship between relative read abundance (RRA) and input DNA amount is often attributed due to technical bias across species in the processes of sampling, library prep and sequencing (Harrison et al., 2021; Lamb et al., 2019). Nevertheless, RRA may still provide valuable information in the absence of other data on community composition (Deagle et al., 2018). Additionally, the application of calibration methods such as incorporating internal standards or modelling underlying PCR processes can help mitigate species-specific amplification bias associated with the metabarcoding approach (Harrison et al., 2021; Shelton et al., 2023; Thomas et al., 2015).

Despite advances in quantitative metabarcoding, input DNA amount may not accurately reflect organismal abundance due to variation in animal size, shedding rates, degradation rates and myriad other environmental factors (Barnes et al., 2014; Carreon-Martinez et al., 2011; Levi et al., 2019; Stoeckle et al., 2017). An alternative approach is to estimate absolute abundance in a sample by leveraging within-species genetic variation to quantify the number of unique genetic contributors in a mixed-DNA sample (Curran et al., 1999; Weir et al., 1997). A major advantage of this DNA mixtures approach over qPCR and metabarcoding is that abundance estimates are decoupled from DNA quantity. In other words, as long as sufficient DNA from each individual is present in the sample and as long as individuals can be distinguished genetically, the absolute abundance estimate is insensitive to differences in the amount of DNA contributed by individual specimens (Sethi et al., 2019). Theoretically, factors that influence input DNA quantities such as organism size, sloughing rate and digestion rate do not influence absolute abundance estimates to the same degree as with qPCR and metabarcoding. The ability to count the number of organisms present in a mixed-DNA sample opens up promising opportunities for count-based ecological inferences, including, but not limited to, estimating the number of individuals on an invasive species front from eDNA, estimating the absolute abundance of a low-population species of conservation concern from eDNA or estimating the number of individuals of an endangered prey species consumed by invasive predators using diet samples (Sethi et al., 2019).

The DNA mixtures approach, which was first utilized for criminal forensics, relies on identifying unique genetic variation at the individual level to infer the number of contributors to a mixed-DNA sample (Bieber et al., 2016; Haned et al., 2011; Weir et al., 1997). An early ecological application of DNA mixture analysis involved genotyping five microsatellite loci and using a heuristic ‘allele counting’ approach to estimate the number of larval yellow perch (Perca flavescens) consumed by predators in river plumes (Carreon-Martinez et al., 2014). The ‘allele counting’ approach utilizes the number of different alleles that manifest in DNA mixture sample at a given locus as an index of the number of contributors while taking into account the ploidy of the target taxon. For example, a DNA mixture sample from a diploid target species that manifests three alleles at a locus would indicate that at least two contributors had to be present to produce the mixture genotype. While allele counting is conceptionally simple, this approach is not reliable beyond 3-individual mixtures (Dembinski et al., 2018). In comparison, a maximum likelihood-based approach can make explicit use of observed alleles present in a sample and their associated population allele frequencies to substantially improve the accuracy of estimates (Haned, 2011; Haned et al., 2011; Perez et al., 2011). The maximum likelihood-based approach involves comparing the likelihood of a different number of contributors to a DNA mixture sample and the one with the highest likelihood is chosen as the most likely scenario.

Sethi et al. (2019) investigated the potential of the likelihood approach for ecological and conservation applications. Their findings suggest that simulated data from single-nucleotide polymorphisms (SNPs) and microsatellite panels allowed for accurate estimation of the number of contributors in mixture samples with up to 10 individuals. As a proof-of-concept demonstration, the authors further applied this approach to both mock mixtures (i.e. a mixture of DNA from different individuals) and stomach samples from predator fishes and genotyped using a small panel of 14 microsatellite markers. However, their results revealed a general tendency to underestimate the true number of genetic contributors in a mixture. In a recent study, Andres et al. (2021) applied the likelihood approach on eDNA samples with a larger panel of 28 microsatellites and showed a similar downward bias as the number of individuals in the mixtures approached 10 individuals. Their results also highlighted the challenges of accurately identifying rare alleles without introducing false-positive alleles when using microsatellites due to PCR stutter and allelic dropout. While SNPs are much easier to genotype, their biallelic nature makes it difficult to identify mixtures of many individuals. To address this issue, Andres et al. (2021) suggested using microhaplotype markers, which leverage the inherent phase information in short-read DNA sequence data to derive multi-allelic microhaplotypes from multiple, proximate SNPs on the same read (Baetscher et al., 2018; Kidd et al., 2014; McKinney et al., 2017). Importantly, low-frequency alleles of microhaplotype markers can be genotyped accurately, which are crucial for accurate DNA mixture analysis and overcome the issues associated with microsatellites.

Here, we build on the previous work by Andres et al. (2021) and Sethi et al. (2019) and apply a likelihood-based DNA mixture genotyping using a 125-locus microhaplotype panel. Our study was motivated by the need to inform the conservation of imperilled salmon populations in the Sacramento-San Joaquin Delta, California, USA (hereafter referred to as the Delta). Habitat changes and the introduction of non-native fish species have fundamentally altered the Delta ecosystem, and many native fishes including Chinook salmon (Oncorhynchus tshawytscha) have experienced significant declines for decades (Carlson & Satterthwaite, 2011; Munsch et al., 2022). The impact of predation by non-native fish species on Chinook salmon populations has been difficult to quantify using conventional visual and molecular assessments of diet. Our study aims to address these limitations by experimentally assessing the feasibility of the DNA mixtures approach for counting Chinook salmon smolts in predator diets.

In this study, we first obtained haplotype frequencies by genotyping 1179 Chinook fin-clip samples using a panel of 125 microhaplotype loci. Secondly, to test the microhaplotype panel on actual amalgamations of DNA in a controlled setting, we estimated the number of contributors (hereafter referred to as NOC) in mock mixtures containing DNA extracts from 2–20 Chinook individuals. Lastly, to apply the optimized estimator in a more realistic setting, we explored the utility of DNA mixture analysis for diet analysis in a large controlled-feeding experiment by estimating the number of Chinook individuals found in the gastrointestinal (GI) tracts of two non-native predators in the Delta, largemouth bass (Micropterus salmoides, LMB) and channel catfish (Ictalurus punctatus, CCF). Our results illustrate the utility of microhaplotype panels for DNA mixture analysis but also illuminate some challenges associated with applying this approach to degraded DNA samples—which are typically what researchers encountered in diet samples.

2 MATERIALS AND METHODS

2.1 Curating a catalogue of haplotypes and estimating their frequencies

Accurate haplotype frequency estimation is crucial for precise estimation of NOC using the DNA mixture approach (Andres et al., 2021; Sethi et al., 2019). A large number of Chinook smolts from Mokelumne River Hatchery in California, USA (N = 1179) were genotyped using Genotyping-in-Thousands by Sequencing (GT-seq; Campbell et al., 2015) and a panel of 125 microhaplotype markers previously developed for Chinook salmon in the Klamath and Sacramento river basins, California, USA (Thompson et al., 2020). Because cross-amplification of predator DNA might interfere with NOC estimation of the prey, we also genotyped fin-clip samples of LMB and CCF specimens collected for this study to examine the level of cross-amplification between predators and Chinook salmon. DNA was extracted from dried fin-clip samples using either a commercial kit (Qiagen DNeasy Blood and Tissue Kits) or an in-house solution (10% Chelex 100 solution containing 1% of Triton-X 100 and 1% Tween 20). One negative control was included on each 96-well extraction plate. GT-seq was conducted following the methods of Campbell et al. (2015) with modifications detailed in Bootsma et al. (2020) except that we used the original postnormalization double-sided SPRI bead size-selection protocol of 0.5× to 1.2× (Campbell et al., 2015). Libraries were sequenced on a benchtop sequencing platform (Illumina MiSeq platform) with 150-bp paired-end sequencing chemistry. An initial GT-seq test run on 377 Chinook samples was conducted to evaluate the 125 microhaplotype markers and we removed any loci with over or under amplification from the panel.

Demultiplexed reads (forward reads only) were processed with trimmomatic v0.39 (Bolger et al., 2014) to remove adapter sequence using the following parameters: ILLUMINACLIP:2:30:10 SLIDINGWINDOW:4:15 MINLEN:50 and the adaptor sequences fasta file provided by trimmomatic, TruSeq-3-PE-2.fa. After trimming, forward reads were mapped to the reference file of consensus sequences of the 125 microhaplotype markers using bwa-mem v 0.7.17 with default settings (Li, 2013). On-target rate was calculated for each sample as the proportion of reads that aligned to amplicons in the microhaplotype panel. To assemble microhaplotypes and obtain read depths for each individual, we used the R package MICROHAPLOT (https://github.com/ngthomas/microhaplot). MICROHAPLOT uses the reference VCF file to obtain SNP positions for each locus and assemble SNPs into microhaplotypes and then extracts microhaplotypes from SAM files (Baetscher et al., 2018).

To obtain a reliable catalogue of microhaplotypes and their frequencies from Chinook salmon tissue samples, we conducted the following filtering steps modified from (Baetscher et al., 2018): (1) we removed incomplete haplotypes, that is, haplotypes with N or X; (2) we removed haplotypes with fewer than 20 reads at a locus and a read depth ratio of <0.2 within an individual (read depth ratio is defined as the ratio between read depth of a haplotype at a locus and the read depth of the haplotype with the highest read depth); (3) we removed monomorphic loci, that is, loci with only one haplotype present across all chinook samples; and (4) we removed loci with more than two haplotypes in any individuals. Genotypes were called from the remaining loci. An individual was called as a heterozygote if two haplotypes remained and a homozygote if only one haplotype remained. Finally, we used an iterative filtering approach to remove samples genotyped in <80% of loci and loci genotyped in <70% of samples. Haplotype frequency was calculated as the number of copies of a haplotype at a given locus, divided by the total number of copies present at that locus in the dataset. A large number of sampled specimens and associated haplotype frequencies are believed to be representative of population-level frequencies in the Delta.

To check cross-amplification of the microhaplotype panel in the predator fish species, we extracted DNA from LMB (N = 190) and CCF (N = 94) fin-clip samples and genotyped these samples using GT-seq as described above. We used the same filtering criteria on predator samples, that is, we removed haplotypes with fewer than 20 reads at a locus and a read depth ratio of less than 0.2 within an individual and assessed the read coverage at the loci/haplotypes shared with Chinook salmon. To check for contamination, we conducted the same analysis on negative control samples.

2.2 Estimating the number of contributors (NOC) in mock DNA mixtures

We constructed 285 mock DNA mixtures containing DNA from 2 to 20 Chinook smolts from Mokelumne River Hatchery in CA (Table 1) to assess the ability of the optimized microhaplotype panel (from above) to accurately estimate NOC across variable numbers of contributors. These mock DNA mixtures were made by pooling 2 μL of extracted DNA per individual and prepared in three replicates (Table 1). No two pools contained the same set of individuals. Genotyping was determined using GT-seq as described above, and we implemented the likelihood-based model described in (Andres et al., 2021; Sethi et al., 2019) to estimate NOC in mock DNA mixtures.

TABLE 1. Number of DNA mixtures constructed for each scenario (number of contributors) in each replicate (3 replicates in total) in the mock DNA mixture experiment.
# Inds Replicate 1 Replicate 2 Replicate 3
2 8 8 8
3 8 8 8
5 16 16 16
7 16 16 16
9 16 16 16
10 16 16 16
15 8 8 8
20 7 7 7

Distinguishing true alleles from technical artefacts is relatively simple for single-source diploid individuals, but this problem becomes more difficult in DNA mixture samples, which contain multiple individuals and thus many alleles appear at low frequencies (Andres et al., 2021). The parameter that needs to be tuned to ensure accurate detection of alleles present in DNA mixtures is the read depth ratio. For individual tissue samples, we used a read depth ratio of 0.2 as suggested by (Baetscher et al., 2018). However, since DNA mixture samples contain multiple haplotypes per sample with varying read depths, we opted for a reduced read depth ratio of 0.02 for DNA mixture samples. Through experimentation with different read depth ratio cutoffs, including 0.002, 0.02 and 0.2, we determined that a ratio of 0.02 achieves a desirable balance of false-positive and false-negative rates (see Section 3). Bias for mock DNA mixture samples was calculated as the [estimated NOC] − [true NOC].

Genetic samples collected for some ecological applications, such as molecular diet analyses, tend to have lower DNA quantity and quality than tissue samples, which often results in locus dropout. After applying quality filtering on the extracted haplotypes, we identified a total of 74 loci (see Results). To assess how variation in locus dropout rates affects the NOC estimates, we constructed reduced marker panel sizes by randomly subsampling 10%–90% of the 74 loci in the final panel in increments of 10%. This allowed us to simulate scenarios where locus dropout occurred and examine how the reduced number of genotyped loci affected NOC estimates. At each level of locus dropout, we also assessed how NOC estimates were affected by different read depth ratio cutoffs (0.002, 0.02, 0.2). We conducted 50 trials for each combination of locus dropout rate and read depth ratio.

2.3 Estimating NOC in the feeding trial

We conducted a large feeding trial to understand how temperature and predator species influence digestion rates using diet analysis results obtained from visual identification, qPCR, metabarcoding and the DNA mixtures approach. Additional details on the feeding trial and results from the visual, metabarcoding and qPCR analyses are available in Dick et al. (2023). Briefly, two non-native predators in the Delta, LMB and CCF, were acclimated for 2 weeks at 15.5 or 18.5°C prior to the initiation of the feeding trial. After the acclimation period, individual predators were force-fed three fall-run Chinook salmon smolts from Mokelumne River Hatchery (average 6.4 g per smolt). At regular intervals postingestion, a subset of 5–10 predators from each species by temperature treatment were euthanized. GI tracts were removed and preserved in 100% lab-grade ethanol. Predator GI tract sampling began 6 h postingestion (t = 6) and continued every 12 h until t = 96 h, and then, a final sample occurred at t = 120 h (5 days) resulting in a total of 10 time points (Table 2).

TABLE 2. Number of gastrointestinal (GI) tract content samples collected from each predatory fish species at two temperatures (15.5 and 18.5°C) and 10 time points postingestion in the feeding trial experiment.
Species Channel catfish Largemouth bass
Temperature 15.5 18.5 15.5 18.5
# Smolt 3 3 3 3
Hours postingestion
6 6 5 9 9
12 5 5 9 9
24 5 5 9 9
36 5 5 8 9
48 6 6 8 10
60 5 5 9 9
72 6 5 8 10
84 5 5 9 9
96 6 5 8 8
120 4 5 5 9

Stomach contents and stool were collected from the preserved GI tract samples. We then combined stool and small pieces of each visible diet item into a 1.5 mL tube, and excess ethanol was removed by centrifugation and pipetting followed by evaporation. DNA was extracted using a commercial kit (Macherey-Nagel Nucleospin 96 DNA Stool kit) with three modifications: (1) We replaced bead-induced lysis with enzymatic lysis; (2) we used a per-sample volume of 25 μL of proteinase-k and 850 μL of lysis buffer ST1; and (3) we incubated overnight at 56°C. See Dick et al. (2023) for detailed dissection methods. One negative control was included on each 96-well extraction plate. GT-seq genotyping and DNA mixture analysis were conducted in the same way as described above.

In total, we dissected 277 GI tract content samples, including 173 samples from LMB and 104 samples from CCF (Table 2). All GI tract samples were genotyped using the microhaplotype panel to determine the ability of the DNA mixtures method to accurately recover NOC as prey items were digested. We chose a read depth ratio of 0.02 for genotyping the stomach samples, which was informed by the results of the mock DNA mixture subsampling experiment and the relatively small number of Chinook smolts fed to each predator. We fit an exponential decay model, y = a 1 r x , with x = hour postingestion and y = mean NOC estimate, to examine the loss rate of the number of detected Chinook salmon smolts in predator GI tract content samples over time across species and temperature. We estimated the initial amount (parameter ‘ a ’) and rate of decay (parameter ‘ r ’) in each decay model.

3 RESULTS

3.1 Curating a catalogue of haplotypes and estimating their frequencies

Initial GT-seq testing using 377 Chinook fin-clip samples (a test set from 1179 Chinook smolt samples) showed that 11 out of 125 microhaplotype loci had either over or under amplification based on the total number of on-target reads across individuals (Figure S1). After removing these 11 loci, the final panel consisted of 114 microhaplotype loci (Table S1). Primer sequence information of the 114 loci can be found in data S5 in Thompson et al. (2020). GT-seq data using the final panel of 114 microhaplotype loci on 1179 Chinook fin-clips yielded an average of 8517 forward reads per sample (range = 1–26,336 reads) and an average of 6509 on-target reads per sample (range = 0–16,656). The median on-target rate was 80.21% (range = 0%–98.5%), with only 11 samples having an on-target rate less than 40% (Figure 1).

Details are in the caption following the image
Comparison of on-target rates among Delta Chinook salmon smolt tissue samples (N = 1179), mock DNA mixture samples made up of Delta Chinook salmon smolts (N = 283) and GI tract content samples from the feeding trial (N = 277). On-target rate was calculated as the total number of on-target reads divided by the total number of reads. Median on-target rates are indicated with red vertical lines.

We implemented a rigorous quality filtering procedure to guarantee sample quality for the mock mixture analyses. Many samples and loci were eliminated in the final filtering step, which entailed removing samples genotyped in less than 80% of the remaining loci and removing loci genotyped in less than 70% of samples (Table S2). After quality filtering, a total of 74 loci and 565 samples remained. Details of the number of loci and samples remaining after each filtering step can be found in Table S2. These 74 loci contained 252 unique haplotypes with a median of 3 haplotypes and a range of 2–7 haplotypes per locus (Figure S2). The curated catalogue of 252 haplotypes had a wide range of haplotype frequencies, ranging from 0.001 to 0.997 with a median of 0.190 (Figure 2a). The majority of the 74 loci contained low-frequency haplotypes. Specifically, 54 loci (73%) had haplotypes with a frequency of less than 0.1, and 46 loci (62%) had haplotypes with a frequency less than 0.05.

Details are in the caption following the image
(a) Broad haplotype frequency distribution of the curated catalogue of 252 unique haplotypes across 74 loci and 565 Delta Chinook salmon smolt samples after stringent filtering. (b) Bias in the estimated number of contributors using genotypes of the above-curated catalogue from various DNA mock mixture samples made up of Delta Chinook salmon smolts. Light grey points are individual mock DNA mixtures, and red points and lines are mean bias ±1 SD. A read depth ratio of 0.02 was used, below which haplotypes were removed.

We applied the same haplotype filtering on 190 LMB and 94 CCF fin-clip samples. Four LMB/CCF samples were outliers in terms of a total number of on-target reads (1166–11,544 reads; Figure S3). After filtering, these four samples still had a non-negligible amount of on-target reads remained (51–3084 reads) whereas the rest of the LMB/CCF samples had zero or close to zero on-target reads after filtering. These four outlier samples shared 80 haplotypes across 58 loci with Chinook. Our results suggest that these four samples were likely contaminated with Chinook salmon DNA, and thus, we conclude that, overall, there was no evidence of cross-amplification of our microhaplotype panel between two predator fish species (LMB and CCF) and Chinook salmon. In addition, there was no systematic contamination in our dataset as the number of on-target reads after filtering was zero across all negative control samples.

3.2 Estimating NOC in mock DNA mixtures

The 285 mock DNA mixture samples yielded an average of 9001 forward reads per sample (range = 1–13,038 reads) and an average of 6807 on-target reads per sample (range = 0–9603). Three samples were dropped due to failed library prep and sequencing run. For the remaining 282 samples, the median on-target rate was 76.24% (range = 64.52%–82.51%; Figure 1). All 74 microhaplotype loci were successfully genotyped in all 282 mock DNA mixture samples (Figure 3a). Using the curated catalogue of haplotypes across the 74 microhaplotype loci described above, along with their frequencies and a read depth ratio of 0.02 (the ratio of 0.02 achieves a desirable balance of false-positive and false-negative rates; see below), NOC estimates generally fell within ±2 from the true NOC in mock DNA mixtures of up to 10 individuals with a mean bias of 0.2 ± 1.1 (Figure 2b). However, apparent negative bias emerged when true NOC was greater than 10, with a mean bias = −2 ± 2.1 for NOC = 15 and mean bias = −5.7 ± 1.6 for NOC = 20 (Figure 2b).

Details are in the caption following the image
Effects of total on-target reads on the number of loci successfully genotyped in (a) mock DNA mixture samples (N = 282) and (b) GI tract samples from the feeding trial (N = 271). A total of 89 GI tract samples (grey points) with fewer than 20% of 74 loci genotyped (15 loci) were removed from the downstream analyses.

Our resampling analysis of the mock mixture indicated that variance increased with increased locus dropout rates and bias increased with larger depth ratio cutoffs (Figure 4). At the read depth ratio of 0.02, higher locus dropout rate resulted in larger variance in NOC estimates with the largest variances observed when genotyping coverage was 20% of loci or less (≤15 loci retained) across all NOC scenarios tested, although the effect was minimal when true NOC was 2 or 3 (Figure S4). In addition, with higher locus dropout rate, the estimate bias moved in the positive direction, especially when true NOC was less than 15. Interestingly, the locus dropout rate only had a marginal effect on the mean estimate bias, which was within ±2 from true NOC up to 15 individuals, suggesting moderate robustness in DNA mixture analysis to locus dropout type errors (Figure S4). Patterns of variance and mean estimate bias at the lowest read depth ratio (0.002) were similar to what was observed at the ratio of 0.02 (Figure 4), indicating such a low threshold of 0.002 was likely below the read ratio of all haplotypes within samples. By contrast, the highest read ratio of 0.2 likely exceeded the read ratio of all but the dominant haplotypes (i.e. haplotypes with highest read depth) within samples, resulting in negative bias in NOC estimates (Figure 4; Figure S4).

Details are in the caption following the image
Bias in the estimated the number of individuals contributing to mock DNA mixtures made up from Delta Chinook salmon smolt samples (range: 2–20 individuals per mixture) with varying simulated genotyping rates (10%–90% of 74 loci) and three read depth ratios (0.002, 0.02, 0.2), below which haplotype sequence reads were removed. Lower genotyping rates corresponded to higher locus dropout rates.

3.3 Estimating NOC in the feeding trial

A total of 277 GI tract content samples (173 LMB samples and 104 CCF samples) were genotyped following the feeding trial, yielding an average of 11,453 forward reads per sample (range = 0–111,528 reads) and an average of 2230 on-target reads per sample (range = 0–46,580). Compared with mock DNA mixture samples, these GI tract samples demonstrated a wide range of on-target rates across samples (0%–87.82%) with the median of 14.66% (Figure 1). The wide range of on-target rates was associated with time postingestion (Figure 5a). Specifically, the on-target rate decreased over time in both species and dropped markedly after 72 h at 15.5°C and after 48 h at 18.5°C (Figure 5a). Six samples were removed due to extremely low on-target reads (≤2 reads). For the remaining 271 samples, the number of successfully genotyped loci increased with the number of on-target reads (Figure 3b). When the total on-target reads reached at least 429 reads (N = 96), at least 90% of 74 loci (67 loci) were genotyped (Figure 3b). We further removed 89 samples with fewer than 20% of 74 loci genotyped (15 loci) because too few genotyped loci led to large variance in NOC estimate based on the subsampling experiment (Figure 4). Notably, these removed samples included all or most samples at 84–120 h postingestion in CCF (Table S3).

Details are in the caption following the image
Changes in on-target rate (a) over time postingestion (up to 120 h) in the GI tract content samples (N = 277) of largemouth bass (LMB) and channel catfish (CCF) at two different feeding trial water temperatures (15.5 and 18.5°C). We removed six samples due to their extremely low on-target reads (≤2 reads) and additional 89 samples due to fewer than 15 loci genotyped in these samples. We estimated the number of contributors (NOC) in each remaining sample (N = 182; b). In (b), light grey points are individual samples (N = 184), and red points and red lines are mean estimate ±1 SD. A read depth ratio of 0.02 was used, below which haplotypes were removed.

In general, we observed a downward trend in estimated NOC over time in both CCF and LMB (Figure 5b), though the pattern was not as obvious as on-target rate (Figure 5a). The mean estimates of NOC were larger than one for up to 48–72 h though with high variance (Table S3). Both LMB and CCF showed exponential decay patterns in NOC through time, presumably as Chinook salmon DNA was digested or evacuated from predator guts (Figure S5).

4 DISCUSSION

Our study provides strong evidence that the likelihood-based DNA mixture analysis paired with a sufficiently variable microhaplotype panel can be used to accurately quantify the number of contributors to mixed-DNA samples containing up to ten individuals. However, we faced substantial methodological challenges associated with highly degraded DNA when applying this method to GI tract samples from piscivorous fish predators in a feeding trial. Our results reveal that the DNA mixtures approach has promises but also potential pitfalls associated with the DNA mixtures approach. Below we discuss the methodological advances achieved in this study, some important considerations and limitations of the study, and how to potentially address them in the future.

4.1 The DNA mixture analysis paired with a microhaplotype panel: A promising approach for future studies

Our study clearly illustrates the benefits of employing microhaplotype markers for DNA mixture analysis. A recent study by Andres et al. (2021) used microsatellite markers genotyped with high-throughput sequencing and faced significant difficulty calling low-frequency alleles. They recommended that future studies test microhaplotype markers to specifically address this issue with low-frequency alleles. Here, we confirmed that microhaplotype markers are well-suited for DNA mixture analysis, due in large part to their ability to reliably genotype low-frequency alleles, which is critical for achieving accurate estimates of NOC.

It is important to note that the panel we used was developed for genetic stock identification of West Coast Chinook salmon, including California Coastal Evolutionarily Significant Unit (ESU), Southern Oregon and Northern California ESU and the Upper Klamath-Trinity Rivers ESU (Thompson et al., 2020) and was not designed to maximize the number of haplotypes at each locus, which would be the goal for optimizing the DNA mixtures analysis. By contrast, panels developed for parentage analysis or other applications often enrich for loci with a high number of alleles, and loci containing over 10 alleles/haplotypes are common (Baetscher et al., 2018). Our results demonstrate that an existing microhaplotype panel (containing a maximum of seven haplotypes per locus) not designed for NOC estimation can still be effective for DNA mixture analysis. Fortunately, designing new panels specifically for DNA mixture applications is not overly onerous, and the workflow for constructing these panels has been thoroughly described in previous papers (Baetscher et al., 2018; Bootsma et al., 2020).

Designing larger panels containing a high number of loci with more haplotypes would likely facilitate accurate NOC estimates for mixtures containing more than the 10 individuals that we could reliably resolve with our current panel. Previous investigations into DNA mixtures suggest that the maximum number of individuals that can be resolved is a function of the number of low-frequency alleles present in a dataset and the ability to accurately identify them. Andres et al. (2021) demonstrated that a microsatellite panel containing 28 loci and 253 total alleles could accurately estimate NOC in samples of up to 58 individuals in silico, but in practice, this panel was limited to resolving mixtures of ~10 individuals due to issues with differentiating true rare alleles from artefacts. Identification of rare alleles was more straightforward with our microhaplotype panel, but we were still potentially limited by the number of loci and the number of total alleles. Future studies could simulate varying panel sizes and assess the potential of microhaplotype panel to resolve even larger mixtures.

Our resampling analysis of known mixtures suggested that the read ratio cutoff should be set as low as possible to facilitate the identification of rare alleles without mischaracterizing true alleles as artefacts. Setting this value is a balance between biasing estimates upwards because artefact alleles are retained and biasing estimates downwards because true alleles are not detected, as discussed in Andres et al. (2021). Increasing sequencing coverage could allow better detection of true alleles and facilitate the use of smaller read ratio cutoffs, but the utility of this approach should be tested on known mixtures due to diminishing returns associated with increasing sequencing coverage of finite PCR products (Rochette et al., 2022). One potential solution to this issue is to perform multiple PCR replicates for each sample and combine the products to reduce the stochastic effects of PCR (Miller et al., 2002), which could cause certain alleles to amplify more readily.

Interestingly, our analysis of various levels of locus dropout rates revealed an unexpected relationship between the number of loci genotyped and the direction of bias in NOC estimates. As fewer loci were genotyped, we observed a more positive bias. Especially when the true NOC was fewer than 15, the NOC was overestimated. In simulated data, this trend occurs in the opposite direction (Sethi et al., 2019), indicating that the positive bias that we observed may be due to artefacts. Specifically, we hypothesize that the upward bias due to artefact alleles is reduced when additional loci are genotyped. The locus dropout subsampling results indicate that accurate mean estimates of NOC can be obtained with relatively few loci, but as the number of loci genotyped decreases, the variance in NOC estimates increases, and it becomes more important to ensure that rare alleles are called correctly and distinguished from artefacts. Our empirical data from the feeding trial suggests that, for degraded DNA samples, the percentage of loci that can be successfully genotyped is positively correlated with the number of on-target reads, meaning that if few loci are genotyped, and the sequencing coverage for each locus is low, the potential for inaccurately identifying rare alleles increases. This is characteristic of poor-quality input DNA, such as that obtained from diets and some environmental samples. We therefore urge caution when estimating NOC using genotype data when a large number of loci failed to genotype.

One aspect of NOC estimation that we anticipated could be a problem was the cross-amplification of microhaplotype loci in predator species. Cross-amplification could inflate the number of alleles at a given locus and upwardly bias NOC estimates. Therefore, a best practice is to verify that no cross-amplification between species occurs. Here, we found no evidence that loci included in our panel amplified in LMB and CCF suggesting that cross-amplification is likely to be minimal in distantly related taxa. However, cross-amplification could become a problem if more closely related species are analysed, such as in systems where multiple congeners are found. Certain microsatellite loci have been shown to amplify in a large number of salmonid species (Scribner et al., 1996; Williamson et al., 2002), and microhaplotype loci developed for kelp rockfish (Sebastes atrovirens) amplify in many other Sebastes species (Baetscher et al., 2023). Fortunately, when loci cross-amplify, alleles are often species-specific and can be dealt with in downstream analyses. If alleles overlap among species, loci containing these alleles should be removed prior to analysis. While it is important to address cross-amplification in DNA mixture studies where multiple species contribute to DNA samples, our study suggests that this issue should be relatively easy to resolve in most instances.

4.2 The utility of the DNA mixtures approach is hindered by low-quality DNA: some potential solutions and future research directions

Our mock DNA mixtures demonstrated the feasibility of accurately estimating NOC from mixed-DNA samples when DNA quality is high. However, resolving NOC in more degraded samples from the feeding trial proved difficult. Mean estimates of NOC were 2–3 (true NOC = 3) for up to 48–72 h postingestion. However, the variance in estimates, even in the early part of the trial, was generally high. These results indicate that the DNA mixtures approach we used can identify whether more than one individual was consumed by a predator. However, the accuracy of individual NOC estimates is likely to be low, limiting the practical resolution of the current approach. One potential way to increase accuracy could be to conduct multiple DNA extractions and/or PCR replicates and use the mean of the replicates as the NOC estimate (Alberdi et al., 2019; Mata et al., 2019). Our subsampling experiment also showed that the mean NOC estimate among replicates tended to be accurate regardless of the number of loci genotyped. However, this does not address the fundamental problem of reduced performance of the microhaplotype panel on degraded samples.

The percentage of on-target reads was already ~30% lower 6 h postingestion than in tissue samples (~80% on-target in tissue samples vs ~50% on-target 6 h into feeding trial). This value continued to descend over time, reaching ~10% at 72 h and functionally zero after that. Interestingly, the trend in the proportion of on-target reads across the feeding trial was very similar to the number of mtDNA metabarcoding reads across the same timespan (Dick et al. 2023). Our data strongly suggest that DNA degradation due to digestive processes is leading to lower proportions of on-target reads, which prevents accurate microhaplotype genotyping. One major advantage of the DNA mixtures approach, in theory, is that it should be robust to variation in DNA quantity. However, our data indicate that the poor performance of the nuclear microhaplotype panel in degraded samples largely negated the advantage of the DNA mixtures approach compared with methods amplifying the more abundant mtDNA such as qPCR and metabarcoding.

Some potential ways to improve the above problem in the future include (1) additional replication such as extraction and PCR replicates, which was discussed above, (2) redesigning the microhaplotype panel by shortening the microhaplotype product size as a potential solution to improve genotyping of degraded DNA samples, (3) additional sequencing coverage, which could potentially improve genotyping accuracy even when the percentage of on-target reads is low and (4) laboratory protocols that enhance the performance of microhaplotype panels with degraded samples. When DNA samples are degraded or fragmented, it can be challenging to obtain high-quality data from microhaplotypes. In such cases, shortening the product size of microhaplotypes can be a viable solution. Our current microhaplotype panel has product sizes of 90–143 bp. However, further shortening the microhaplotype product size may come with a cost of reduced genotyping resolution, limiting the ability to distinguish individuals with similar genetic profiles. Therefore, researchers are advised to carefully consider the potential trade-offs when designing panels. Increasing sequencing coverage could also improve results, especially in terms of confidently identifying rare alleles. Previous research has shown that increasing sequence depth increases the number of taxa recovered for eDNA samples (Shirazi et al., 2021), which is similar in concept to identifying rare alleles. However, while increased depth could improve results, it is likely that this is not a problem that researchers can ‘sequence their way out of’ given the extremely poor performance of highly degraded samples. Instead, we suggest that future studies focus efforts on improving laboratory protocols for extracting and amplifying degraded DNA and incorporate best practices (Deagle et al., 2006; Rohland et al., 2018).

It is important to note that the library prep protocol we used was designed for high-throughput analysis of thousands of fish tissue samples for genetic stock identification and was not optimized for degraded DNA. We suggest that future studies conduct an additional bead cleanup and quantify and normalize after PCR1 rather than use normalization plates after PCR2, which are designed to reduce high concentrations of DNA to a uniform concentration but are not effective if DNA concentrations are already below the expected input threshold (250 ng/well). Additionally, we suggest that future studies use quantification results to pool samples of similar quantity, and therefore likely similar quality, to reduce high variation in read counts across samples and amplification dominance. Finally, we suggest conducting iterative rounds of library preparation and sequencing to obtain usable data from as many samples as possible. In our experience, samples perform better in smaller batches, and this time-consuming iterative approach of analysing small batches of poor-quality samples may be the most feasible way to improve results barring sequencing for sequencing degraded samples. Our suggestions focus on the analysis of degraded but high-quantity DNA from diet studies, but may also be useful for aquatic eDNA studies, where DNA quantity is potentially more of an issue than quality (Harrison et al., 2019). Quantifying the performance of different amplicon sequencing approaches with highly degraded and low-quantity DNA using controlled dilution and DNA shearing experiments would help advance the application of the DNA mixtures method for both eDNA and molecular diet studies.

5 CONCLUSION

As discussed at length in Andres et al. (2021) and Sethi et al. (2019), the DNA mixtures method could be leveraged to address a multitude of important topics related to conservation, management and ecology of wild populations. However, developing methods to reliably estimate NOC in mixed-DNA samples with variable qualities and quantities has been difficult. Our study demonstrated that accurate NOC estimates for samples containing up to 10 individuals can be obtained using a panel of ~100 microhaplotype loci genotyped with GT-seq chemistry, and that this approach is more effective for accurately identifying rare alleles compared with microsatellites. However, analysis of highly degraded samples from a feeding trial produced relatively poor results due to a low percentage of on-target reads. We suggest that future studies focus on improving DNA extraction protocols for GT-seq analysis with highly degraded and low-quality samples. Substantial methodological improvements have made it feasible to implement the DNA mixtures method for nonmodel organisms in ecological studies. However, technical barriers still exist. We expect that future studies will successfully address these barriers, facilitating the widespread use of the DNA mixtures method to address important questions in conservation and ecology.

AUTHOR CONTRIBUTIONS

Y.S., C.M.D., M.J.H. and W.A.L. designed the study. C.M.D. conducted the feeding trials. K.K. and D.B. carried out the molecular laboratory work. Y.S. conducted the data analyses. Y.S. and W.A.L. drafted the manuscript. M.V.M. supervised the project. All authors edited the manuscript and gave final approval for publication.

ACKNOWLEDGEMENTS

This study was supported by a grant from California Metropolitan Water District (project number L04719). All animal experiments were conducted under Humboldt State University IACUC # 2021F5A. We thank Katie D'Amelio from NOAA Alaska Fisheries Science Center for assistance with laboratory work. We thank Fred Feyrer and Justin Clause from the U.S. Geological Survey for the collection of wild fish and Nann Fangue, Dennis Cochran and Sarah Baird from UC Davis for assistance with animal husbandry. We thank Anthony Clemento from NOAA Southwest Fisheries Science Center for providing primer pool aliquots and reference files of the microhaplotype panel and helping with MICROHAPLOT. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the US Government.

    CONFLICT OF INTEREST STATEMENT

    The authors declare that they have no conflicts of interest.

    DATA AVAILABILITY STATEMENT

    Demultiplexed GT-seq data used in this study are archived in the NCBI Sequence Read Archive with a BioProject ID, PRJNA917209. Sample meta information along with their sequence accession numbers can be found in Table S4. Other input data files and all bioinformatic scripts supporting this article are available on the GitHub repository (https://github.com/melodysyue/DNAmixture). Questions pertaining to data generated for this project should be directed toward the corresponding author.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.