Volume 21, Issue 3 pp. 677-689
Resource Article
Full Access

Biased assessment of ongoing admixture using STRUCTURE in the absence of reference samples

Sara Ravagni

Sara Ravagni

Conservation and Evolutionary Genetics Group, Doñana Biological Station (EBD-CSIC), Seville, Spain

Search for more papers by this author
Ines Sanchez-Donoso

Ines Sanchez-Donoso

Conservation and Evolutionary Genetics Group, Doñana Biological Station (EBD-CSIC), Seville, Spain

Search for more papers by this author
Carles Vilà

Corresponding Author

Carles Vilà

Conservation and Evolutionary Genetics Group, Doñana Biological Station (EBD-CSIC), Seville, Spain

Correspondence

Carles Vilà, Conservation and Evolutionary Genetics Group, Doñana Biological Station (EBD-CSIC), Avd Americo Vespucio 26, 41092 Seville, Spain.

Email: [email protected]

Search for more papers by this author
First published: 21 October 2020
Citations: 6

Abstract

Detection of hybridization and introgression is important in ecological research as in conservation and evolutionary biology. STRUCTURE is one of the most popular software to study introgression and allows estimating what proportion of the genome of each individual belongs to each ancestral population, even in cases where no reference sample from the ancestral nonadmixed populations is previously identified. In spite of its frequent use, some studies have indicated that ancestry estimates may not always be reliable. We simulated population data under different conditions with regard to the genetic differentiation between ancestral populations, number of loci considered, number of alleles per marker and hybridization rate, and analysed data with STRUCTURE. When reference samples were not included, the comparison of the known degree of admixture for each simulated individual and the value estimated with STRUCTURE revealed a strong underestimation of the level of introgression, classifying many admixed individuals as nonadmixed. This derives from an inaccurate estimation of the ancestral allele frequencies. When samples from the nonadmixed ancestral population were included as reference in the analyses, the bias in the estimations was reduced. The most accurate estimates were obtained when potentially admixed samples were few in relation to reference samples. Thus, whenever possible, a very large proportion of nonadmixed reference samples should be included in admixture assessments and different approaches should be combined. The misestimate of the amount of introgression can impair our understanding of the evolutionary history of species and misguide conservation efforts.

1 INTRODUCTION

Hybridization is a major concern in conservation biology (Rhymer & Simberloff, 1996) but also a source of evolutionary innovation (Abbott et al., 2013; Arnold, 2015) and adaptive introgression (the introduction of genes from one evolutionary lineage into the gene pool of another) has been shown to play an important role in the evolution and diversification of different clades (Burgarella et al., 2019; Hamilton & Miller, 2016; Oziolor et al., 2019; Suarez-Gonzalez et al., 2018). Thus, the identification of admixed individuals and the assessment of the levels of introgression are fundamental steps to study the effect of hybridization on populations. However, the detection of these admixed individuals can be problematic when the parental species show limited phenotypic differentiation (Randi, 2008). The ineffectiveness of morphological criteria in differentiating cryptic hybrids or admixed individuals (Oliveira et al., 2008) promoted the use of highly polymorphic genetic markers, such as microsatellites or SNPs. The development of software programs based on the analysis of panels of genetic markers and the implementation of Bayesian clustering models has facilitated the detection of hybrids and the assessment of introgression of genes of one species into the gene pool of another.

One of the most widely used programs for these analyses is STRUCTURE (Pritchard et al., 2000) which allows identification of the origin of the genome of each individual. After choosing a K value, i.e., the number of populations, STRUCTURE subdivides the sample into K different clusters - trying to minimize departures from Hardy-Weinberg equilibrium and linkage disequilibrium - and estimates for each individual the proportion of the genome that could originate from each cluster (Barilani et al., 2007). In this way, STRUCTURE is very efficient at separating populations and identifying admixed individuals. A Scopus search (on 26 June 2019) for articles citing the original paper of Pritchard et al. (2000) and containing the words ‘hybrid*’ or ‘introgression’ in the title, abstract or keywords returned a total of 3,446 articles, highlighting the popularity of the software to detect hybridization and population admixture.

However, the power of STRUCTURE to accurately estimate the amount of introgression has been questioned as it could underestimate the number of admixed individuals (Randi, 2008; Sanchez-Donoso et al., 2014; Sanz et al., 2009). The main limitation seems to be related to the detection of old admixture as opposed to F1 hybrids and first generations backcrosses (Oliveira et al., 2008). It has been suggested that estimates of the amount of introgression could be improved by increasing the number of loci (Pritchard et al., 2000; Randi, 2008; Vähä & Primmer, 2006) and using both linked and unlinked markers (Lecis et al., 2006), as this could enable a better assignment of admixed individuals in separate genotypic classes and the identification of past events of hybridization. The use of high number of loci derived from genome-wide data increases resolution in the detection of varying levels of introgression by identifying regions of the genome of different origin (Gómez-Sánchez et al., 2018). Unfortunately, this is not always feasible due to the lack of a reference genome, or due to a trade-off between costs and number of specimens to analyse. In order to characterize population variability, large numbers of samples may need to be studied, making genomic approaches unaffordable and, despite technical advances, studies with reduced number of loci continue to be frequent in day-to-day studies of hybridization and admixture (for example, see Alacs et al., 2010; Arias et al., 2019; De Barba et al., 2017; Sujii et al., 2019). In addition, large numbers of unmapped markers are not necessarily an advantage for the study of ongoing hybridization because they could yield redundant information if they are in complete linkage.

Another factor affecting the performance of STRUCTURE is the sampling scheme, i.e. whether or not samples of each parental population or species are similar in size, and the phylogenetic relationships between the two species (Neophytou, 2014; Puechmaille, 2016). A previous study using empirical data of populations with known ancestry showed that STRUCTURE outperforms other common approaches in the identification of admixed individuals (Bohling et al., 2013). However, this study emphasizes the importance of including in the analyses a portion of individuals that can be a priori diagnosed as nonadmixed for the two parental classes and this may not always be feasible. In many studies genetic analyses are carried out to identify potentially admixed individuals without previously defining reference samples (for example, see Godinho et al., 2011; Muñoz-Fuentes et al., 2007; Oliveira et al., 2008; Ortego et al., 2017; Trigo et al., 2013). This can be particularly important in cases with high levels of hybridization and persistent introgression throughout the distribution range, or in cases where introgression takes place in geographically structured populations for which reference samples from another population may not be appropriate (for example, see Glover et al., 2017; Lavretsky et al., 2019; Sullivan et al., 2016).

In this study we used simulations to determine the accuracy of STRUCTURE in the estimation of the individual level of introgression when nonadmixed reference individuals were not available. We assessed the importance of the number and polymorphism of the loci used, the divergence between the ancestral populations and different rates of hybridization between them. We also studied how adding appropriate reference samples impacts the reliability of the estimates. Our goal was to identify ways to improve the accuracy of the estimates of the degree of introgression.

2 MATERIALS AND METHODS

We carried out simulations of asymmetric gene flow from one population to another to reduce complexity and to facilitate analyses because the allele frequencies for one of the populations would not change over time. However, this is not an uncommon situation. Some examples of this kind of gene flow are the hybridization between domestic and vulnerable wild species (Godinho et al., 2011), admixture between a rare species and an abundant one (An et al., 2017), restocking of game species with farmed animals of alien origin (Sanchez-Donoso et al., 2014), or directional gene flow as a result of the biology of the species hybridizing (Muñoz-Fuentes et al., 2007).

2.1 Simulation of ancestral populations

We used the software easypop v. 2.0.1 (Balloux, 2001) to simulate pairs of diploid random mating populations, genotyped for 200 unlinked loci assuming different levels of polymorphism, with (a maximum of) either two, five or 10 alleles per locus. Mutation rate for markers with five or 10 alleles was set at 10–3 and for markers with two alleles at 10–8, as commonly assumed for microsatellites and SNPs (Drake et al., 1998; Ellegren, 2004; Payseur & Nachman, 2000). We assumed a mutation model (K-allele model: KAM) for markers with two alleles: each allele had the same probability to mutate to the other allelic state. We used a mixed model including single step mutation model (SSM) for markers with up to five or 10 alleles, with a proportion of 0.3 of KAM events (Ellegren, 2004). We generated between 700 and 5,000 individuals per generation and ran the simulations for at least 1,000 generations to assume a long time of separation between the populations and approach mutation-drift equilibrium and stable differentiation. We generated 100 pairs of populations for each kind of marker and with genetic differentiation, measured as FST, around 0.05, 0.1 and 0.2. Thus, we simulated 900 pairs of populations genotyped for 200 loci each (3 kinds of markers × 3 levels of differentiation × 100 replicates = 900 pairs of populations). Parameters used for the runs are reported in Table S1. For details of the simulations see Supporting Information, Figures S1 and S2.

2.2 Simulation of allele introgression

We wrote a script in Python 2.7 to simulate different hybridization rates and to subsample a random subset of individuals from each population to be analysed with STRUCTURE (see Figure 1 for a schematic representation of the simulations). Population A was where admixture took place due to some individuals arriving from population B. The goal of the analyses was to assess if it was possible to correctly estimate the degree of introgression in individuals sampled from the population A using STRUCTURE. Hybridization rate, that is the proportion of breeders originating from population B that contributed to the offspring of population A every generation, was around 1% (0.01) or 5% (0.05). Although high, rates of hybridization this high have been described for diverse taxa (for example, see Lavretsky et al., 2019; Muñoz-Fuentes et al., 2007; Nussberger et al., 2014). A key factor here is that we assume recurrent hybridization every generation and that admixed individuals do not have a reduced fecundity so that they are able to freely interbreed with other individuals in population A. To initiate the simulations (at generation 0), the program randomly selected 1,000 individuals from populations A and B from one of the 100 initial pairs of populations generated with Easypop for a given combination of FST and type of marker.

Details are in the caption following the image
Pipeline to simulate introgression and subsampling for STRUCTURE analyses. First, 100 pairs of populations for each set of cases (pairwise FST = 0.05, 0.1 or 0.2, typed at 200 marker loci with two, five or 10 alleles) were simulated with Easypop. For 10 generations in population A, we randomly selected pairs of genotypes from the previous generation in population A or from population B according to the hybridization rate, and generated genotypes of 1,000 offspring. Finally, we randomly subsampled 100 individuals from each population A and B (with genotypes for 10, 30 or 100 randomly chosen loci) and genotypes were analysed in STRUCTURE. From the output of this program, we extracted estimates of ancestry and compared to the real values derived from the proportion of ancestry from population A in the simulations (see text) [Colour figure can be viewed at wileyonlinelibrary.com]

For 10 generations, the programme generated the same number of individuals (1,000) by selecting two parents from the previous generation. Each parent was randomly selected from population A with a probability of (1 − m), where m was the hybridization rate (0.01 or 0.05); otherwise, the parent was randomly chosen from population B (geneflow was unidirectional from B to A). The genotype of the offspring was obtained by randomly selecting one of the two alleles from each parent at each locus with equal probability; loci were independent. For each individual, we calculated the proportion of the genome belonging to population A (qreal): for individuals originating from population B, this proportion was qreal = 0.0; for individuals from population A at generation 0, this proportion was qreal = 1.0; for subsequent generations, the proportion was the average of the values of the two parents. To confirm that the script was working as expected, we calculated the level of introgression expected after 10 generations in the different scenarios as in Verdu and Rosenberg (2011). Indeed, the value corresponded to the average level of introgression per individual in population A.

At the tenth generation, a random subsample of 100 individual genotypes was taken both from population B and from the introgressed population A to be analysed with STRUCTURE. In these subsamples, the number of loci was reduced to 10, 30 or 100 loci chosen at random. The data were used to generate an input file for STRUCTURE. As a result, 5,400 runs of STRUCTURE were carried out with 200 genotypes each (100 from the admixed population A and 100 from population B): three different values of FST (0.05, 0.1 and 0.2) × three kinds of markers (two, five or 10 alleles) × three numbers of loci (10, 30 and 100) × two hybridization rates (1% and 5%) × 100 replicates (100 pairs of populations simulated in Easypop for each set of conditions with regard to FST and type and number of loci).

The script for these simulations is available at https://github.com/sararvg/introgression_structure. This script was slightly modified for subsequent analyses.

2.3 Analysis of simulated data sets

The simulated data were analysed in structure v. 2.3.4 under the admixture model, as each individual may have ancestry in both initial populations, with correlated allele frequencies. We also carried out about 15% of the STRUCTURE runs under the independent allele frequency model but results were practically identical (data not shown) and we decided to focus on the first model. Analyses were run without population or location information, i.e., with the options USEPOPINFO and LOCPRIOR set to 0, in order to allow assignments based only on genetic information. INFERALPHA was set to 1 to let STRUCTURE infer α (the relative admixture levels between populations) from the data. K was set to 2 to try to separate the two initial populations. After visually confirming that this was enough for convergence, runs were carried out using 30,000 burnin steps followed by 100,000 iterations of MCMC, with only one replicate for each data set. We prepared a script in Python to extract the estimated proportion of the individual's genome corresponding to population A (qSTRUCTURE) from the STRUCTURE output.

We graphically compared qreal and qSTRUCTURE with the package ggplot2 (Wickham, 2009) using r v. 4.0.2 (R Core Team, 2020) in rstudio v. 1.3.959 (RStudio Team, 2020). All statistical analyses were carried out using the same versions of r and rstudio. We tested if qSTRUCTURE estimates were significantly higher than qreal values by performing a Wilcoxon signed rank test after excluding individuals from population B, with the function wilcox.test() from the coin package (v.1.3.1, Hothorn et al., 2008). We tested for a linear relationship between qreal and qSTRUCTURE for individuals sampled in population A through linear regressions with the function lm(). We visually examined the normal distribution of the residuals of the regressions and only reported results for the cases in which this requirement was fulfilled. In these cases, we used generalized linear models to test if the absolute difference between qreal and qSTRUCTURE depended on FST and on the number of alleles, both included in the model as explanatory variables. We run these models under a beta distribution with the function betareg() from the betareg package (v. 3.1.2, Zeileis et al., 2012). As the response variable included 0 and 1, we applied the transformation suggested by Smithson and Verkuilen (2006): y′ = [(y * (n − 1) * 0.5)/n], where n is the sample size. We visually checked the models for homoscedasticity and normality of the residuals.

Potential over- or underestimation of the proportion of genome belonging to population A could be related to an inaccurate estimation of the ancestral allele frequencies for populations A and B. To visualize changes in the allele frequency estimates, we carried out 100 additional runs of STRUCTURE sampling individuals at generations 0, three, six and 10 of the simulations (this was done for a single case, i.e., FST = 0.1, 30 loci with 10 alleles and hybridization rate of 5%). From the output, we extracted allele frequencies estimated for all loci for the ancestral populations A and B. The matrices corresponding to both populations were compared to the true allele frequencies calculated from the populations generated by Easypop, before admixture started. For the comparison we used a distance calculated as the sum of the squared differences between each pair of allele frequencies (frequency for one allele at one locus in the ancestral population minus the frequency estimated by STRUCTURE), divided by the number of loci.

To evaluate if ancestry estimates obtained with STRUCTURE improved with the inclusion of reference samples, we carried out additional simulations for four cases (FST = 0.1, 30 loci with five and 10 alleles, and both hybridization rates) but now the sample for population A included 10% or 30% of individuals from the ancestral population (generation 0). We run the analyses without providing information about the locality or providing this information (USEPOPINFO option, PopFlag was set to 1 only for individuals belonging to the ancestral population A to use them as reference). For the same cases we also tested the effect of activating the POPALPHAS option to infer α for each population separately, which is suggested in cases of strong asymmetric admixture in the STRUCTURE manual. We tested with generalized linear models how the differences between qreal and qSTRUCTURE for the samples of the admixed population A were affected by the proportion of samples used as a reference (0%, 10% or 30%), number of alleles (five or 10) and the use of the USEPOPINFO and POPALPHAS options.

We also assessed the reliability of estimates obtained with STRUCTURE when almost all samples included in the run were used as reference and the introgression was assessed in just a few target individuals. We selected 100 individuals representing the full range of qreal values (evenly distributed from 0 to 1). Twenty groups of five of these individuals were randomly selected without replacement and analysed together with 100 samples from A before admixture and 100 samples from B (for FST = 0.1) using 30 or 100 markers of five or 10 alleles. This process was repeated 10 times. The values of qreal and qSTRUCTURE were then compared for the target individuals.

To confirm the importance of using appropriate reference samples, we analysed data from a natural population. We used a data set of wild common quails (Coturnix coturnix) and game farm quails from Sanchez-Donoso et al., (2014), genotyped at nine autosomal microsatellite loci. A previous study had shown that game farm quails used for restocking were a genetically diagnosable mix of common and Japanese quails (C. japonica; Sanchez-Donoso et al., 2012). We randomly selected genotypes from 100 wild quails from NE Spain obtained from 2007–2010 and analysed them in STRUCTURE together with 52 quails from game farms to assess the impact of the restocking on the natural populations. Afterwards, 10 samples collected in the same area in 1996–1997, before most releases of farm quails for hunting, were added as reference common quails free of introgression. The analyses were conducted using POPINFO and POPALPHAS. We tested if STRUCTURE suggested lower degree of admixture in the wild population in the absence of reference samples by performing a Wilcoxon signed rank test after excluding farm and reference individuals.

Finally, in order to assess if the biases detected when analysing the data with STRUCTURE were common to other programs also used to assess introgression, we also compared qreal to estimates of q obtained with admixture v. 1.3.0 (Alexander & Lange, 2011), ohana v.1.0 (Cheng et al., 2017) and snmf v. 2.0 (Frichot et al., 2014). We used default parameters and K was set to 2. For Ohana, the maximum number of steps was set to 130,000 to simulate the iterations used in STRUCTURE and for sNMF alpha was set to 0.5. Since some of these programs are designed for analyses of markers with two alleles (SNPs), the comparisons were restricted to the cases of FST = 0.1, 100 loci with two alleles and hybridization rates of 1% and 5%.

3 RESULTS

We graphically compared the proportion of the genome coming from population A as estimated by STRUCTURE (qSTRUCTURE) with the real values (qreal, Figure 2). Each plot represents 100 runs of STRUCTURE with 200 genotypes, resulting in 20,000 pairs of values of qSTRUCTURE and qreal. Ideally, if the estimates of q by STRUCTURE precisely corresponded to the real values, all points should fall on the diagonal of the diagrams. As expected, increasing the number of loci and the number of alleles per marker improved precision (reduction in the variance) in the estimates of qSTRUCTURE and, to a lesser degree, it also improved accuracy (similarity between qSTRUCTURE and qreal values). At the same time, comparing cases with the same number of loci and alleles per marker, qSTRUCTURE values showed lower variance as the divergence level between the hybridizing populations increased (e.g., Figures S3a versus S4a). However, the comparison of qSTRUCTURE and qreal revealed systematic biases.

Details are in the caption following the image
Individual proportion of genome belonging to population A estimated with STRUCTURE (qSTRUCTURE) compared to the real proportion (qreal) calculated during the simulations. Simulations of admixture were conducted for two populations differentiated with FST = 0.1. Each panel represents 20,000 pairs of values, 10,000 pairs originating from population A and 10,000 from individuals from population B (100 runs with 100 individuals from each one of the two populations). If the estimates of STRUCTURE precisely corresponded to real values, points should lay on the diagonal. (a) hybridization rate of 1% per generation; (b) hybridization rate of 5% [Colour figure can be viewed at wileyonlinelibrary.com]

Considering a hybridization rate of 1% (Figure 2a, S3a and S4a), for markers with two alleles, qSTRUCTURE estimates tended to be quite independent from qreal, forming scattered clouds of points (meaning that STRUCTURE provided a poor assessment of the ancestry), except when the number of loci was 100 and the genetic differentiation was strong (Figure S4a). When the number of alleles per locus was 5, estimations using 10 or 30 loci were not reliable either and had a large variance. The consistency of the estimates increased notably with the use of 100 loci when FST was 0.1 or 0.2 (Figure 2a and S4a). In the cases of markers with 10 alleles, in general, estimations started to improve from 30 loci for all values of FST, with reduced variance in qSTRUCTURE values. However, even in the cases with the lowest variance, qSTRUCTURE tended to be larger than qreal (Figure 2a).

With a hybridization rate of 5% (Figure 2b, S3b and S4b), the entire population A was admixed and no pure individuals were left, i.e., no individuals whose genomes derived solely from the ancestral population; the population had turned into a hybrid swarm. Although qreal values were always smaller than 0.85, qSTRUCTURE values tended to be higher, identifying many admixed individuals as nonadmixed (qSTRUCTURE close to 1) and therefore underestimating the magnitude of the introgression in the population (Figure 2b, S3b and S4b).

The Wilcoxon signed rank test confirmed that qSTRUCTURE values were significantly higher than the corresponding qreal (p < 10–16) for individuals from population A, except in five cases in which qSTRUCTURE was practically uninformative (Figure 2a: FST = 0.1, hybridization rate of 1%, 10 markers with two alleles; Figure S3a: FST = 0.05, hybridization rate of 1%, 10 markers with two and five alleles, as well as 30 markers with two alleles; Figure S3b: FST = 0.05, hybridization rate of 5%, 10 markers with two alleles). This implies that there was a tendency to overestimate the proportion of the genome from the ancestral population in practically all cases (qSTRUCTURE > qreal). A significant linear relationship between qreal and qSTRUCTURE was found in nine cases, when 100 loci with five alleles (only for a high hybridization rate of 5%) or 10 alleles were used (Figure S5), and in all nine cases the regression line was above the diagonal (intercept significantly higher than 0, p < 10–16). The fact that in the other cases no relationship could be found between qSTRUCTURE and qreal highlights the limited power of analyses with reduced number of loci and alleles.

We used generalized linear models to assess both the effect of the degree of differentiation between the ancestral populations (FST) and the number of alleles per marker on the absolute difference between qSTRUCTURE and qreal (the bias in the inference of q) for the two hybridization rates and considering 100 loci. Both variables proved to have a highly significant effect (p < 10–16). The increase in genetic differentiation showed the stronger effect in reducing the difference between the two q values (Table 1), therefore improving STRUCTURE estimates, but having markers with more alleles also helped.

Table 1. Generalized linear models testing the effect of FST and number of alleles over the absolute difference between qreal and qSTRUCTURE for samples from the admixed population A
Response variable Hybridization rate Explanatory variables Estimates z p
|qSTRUCTURE − qreal| 1% (intercept) −2.541 −300.71 <2e−16
F ST −1.007 −26.93 <2e−16
Number of alleles −0.014 −14.72 <2e−16
|qSTRUCTURE − qreal| 5% (intercept) −0.838 −109.26 <2e−16
F ST −0.943 −27.98 <2e−16
Number of alleles −0.0129 −15.4 <2e−16

Note

  • The two factors have a significant effect. We used 100 markers with five or 10 alleles and the two rates of hybridization.

The degree of overestimation of q by STRUCTURE can be better appreciated in the distribution of qSTRUCTURE − qreal for population A (Figure 3, S6 and S7). While the overestimation was limited when the hybridization rate was 1%, it dramatically increased when the rate was 5%. We compared the distribution of qreal and qSTRUCTURE for an example case (FST = 0.1, 30 markers with 10 alleles) for both hybridization rates (Figure 4). Although the two hybridization rates resulted in very different populations with regard to the level of introgression (see qreal in Figure 4), the results obtained with STRUCTURE were almost identical (see qSTRUCTURE), suggestive of a relatively low introgression, and showing that STRUCTURE was unable to differentiate the two cases.

Details are in the caption following the image
Density plots of qSTRUCTURE − qreal for individuals from population A. Simulations were carried for a differentiation of FST = 0.1 between the ancestral populations. (a) hybridization rate of 1%; (b) 5%. Data should ideally distribute around value 0 (marked with a vertical line). However, STRUCTURE analyses tended to overestimate the proportion of genome belonging to population A (qSTRUCTURE − qreal), being the overestimation bigger with higher hybridization rate [Colour figure can be viewed at wileyonlinelibrary.com]
Details are in the caption following the image
Density plots for qreal (a) and qSTRUCTURE (b) resulting from hybridization rates of 1% and 5%). Simulations were carried for a degree of differentiation between the ancestral populations (FST) of 0.1 and 30 loci with 10 alleles. The values around q = 0 corresponded to individuals from population B, while the rest reflect the admixed population A. Although qreal indicated different biological situations for the two hybridization rates (a), with very different number of nonadmixed A individuals (q close to 1), qSTRUCTURE values suggested that the two situations resulted in similar introgression (b) [Colour figure can be viewed at wileyonlinelibrary.com]

The allele frequencies estimated by STRUCTURE for population A were progressively diverging from the ancestral as the number of generations of introgression increased (Figure 5). At generation 0, before any admixture, the estimates of allele frequencies for populations A and B were similar to the frequencies calculated from the ancestral populations. The distances in this case were not 0 because the estimates obtained by STRUCTURE were based on subsamples of the populations, resulting in sampling errors. As introgression increased the number of alleles from population B into population A in subsequent generations, STRUCTURE estimates of the allele frequencies in population A diverged more strongly from the ancestral ones, while estimates of the allele frequencies for population B remained unaffected. As estimations of q are associated to the inferred allele frequencies, the inaccuracy in the ancestral allele frequency estimates could lead to the observed overestimations in qSTRUCTURE for individuals originating from population A.

Details are in the caption following the image
Distance between estimates of ancestral allele frequencies obtained with STRUCTURE and their real values for populations A and B as introgression of alleles from B to A advanced. A total of 100 STRUCTURE estimates of ancestral allele frequencies were obtained at generations 0, three, six and 10. In population B, estimates corresponded closely to the real values in all generations. For the admixed population A, divergence from the ancestral allele frequencies increased with introgression showing that STRUCTURE did not correctly estimate ancestral allele frequencies [Colour figure can be viewed at wileyonlinelibrary.com]

A possible solution to improve accuracy in the estimates of the allele frequencies could be the inclusion of reference individuals from the ancestral population A. Generalized linear models showed that (with a hybridization rate of 5%) the accuracy of qSTRUCTURE estimates improved by increasing the proportion of reference individuals and the number of alleles per marker, as well as activating the POPALPHAS option and marking reference individuals with USEPOPINFO (Table 2; Figure 6 and S8). However, qreal and qSTRUCTURE were still very different (Figure 6). When the hybridization rate was lower (1%) the effect of including reference samples was not obvious, and the generalized linear model could not be fitted because the model assumptions were not fulfilled.

Table 2. Generalized linear models explaining the differences between qreal and qSTRUCTURE for samples from the admixed population A
Response variable Hybridization rate Explanatory variables Estimates z p
|qSTRUCTURE − qreal| 5% (intercept) −0.863 −140.89 <2e−16
Reference −3.432 −205.56 <2e−16
Number of alleles −0.014 −19.32 <2e−16
USEPOPINFO (activated) −0.133 −33.92 <2e−16
POPALPHAS (activated) −0.499 −140.03 <2e−16

Note

  • The explanatory variables were the proportion of samples used as a reference (0%, 10% or 30%), number of alleles of the markers (five or 10) and use of the USEPOPINFO and POPALPHAS options in STRUCTURE for simulations using 30 loci and a hybridization rate of 5%. The strongest effect is associated to the proportion of individuals used as reference (an increase in the proportion of individuals used as reference leads to a decrease in the difference between the estimates).
Details are in the caption following the image
Comparison of qSTRUCTURE and qreal when 30% of the individuals from the target population are sampled from the ancestral population and are used as reference. Individuals used as reference and those belonging to population B were excluded from the plots. Simulations were carried out with FST = 0.1 and 30 loci, varying the number of alleles per marker and the hybridization rate. The accuracy of qSTRUCTURE estimates improved notably with the inclusion of reference individuals (compare to Figure 2b) and when activating the POPALPHAS option, especially for the cases with higher hybridization rate, where no pure individuals remained in the admixed population [Colour figure can be viewed at wileyonlinelibrary.com]

Despite the improvement in the estimates with the inclusion of reference individuals, the biases were still very apparent even in the case when 30% of the individuals from A corresponded to reference samples (Figure 6) and qSTRUCTURE values were still significantly higher than the corresponding qreal (p < 10–16). This could be due to inherent biases in STRUCTURE or difficulties in the inference of the ancestral frequencies when a large proportion of the samples derived from the admixed population. To investigate which was the case, we compared qreal and qSTRUCTURE obtained when analysing small sets of 5 admixed individuals with 100 samples from A before admixture and 100 samples from B. The results (Figure 7) show that the biases in the estimates of qSTRUCTURE practically disappeared. Variance of qSTRUCTURE estimates was quite large when using 30 markers but centred around the corresponding qreal values (along the diagonal in the figures), but still qSTRUCTURE values were significantly higher than qreal (p < .003). As expected, using 100 markers greatly reduced this variance and with 10 alleles qSTRUCTURE and qreal values were not significantly different (p-value = .510). These results imply that STRUCTURE estimates were not intrinsically biased and including a very large number of reference samples and high number of markers helped to reduce biases in the estimates.

Details are in the caption following the image
Comparison of qSTRUCTURE and qreal when most of the analysed individuals were used as reference and admixture was evaluated in just a few individuals. Simulations were carried out for a differentiation of FST = 0.1 between the ancestral populations. Each panel represents 10 independent qSTRUCTURE estimates for 100 admixed individuals representing the full range of qreal values. In each STRUCTURE run, 100 individuals from the ancestral population A and 100 from population B were analysed together with five admixed samples. The points lie around the diagonal showing that qreal and qSTRUCTURE tended to be similar [Colour figure can be viewed at wileyonlinelibrary.com]

The analysis of a data set from a natural population of common quails that experienced introgression from farm quails showed the same pattern when we added reference samples belonging to the same population but from before most of the restocking campaigns (Figure S9). The q values estimated after adding reference showed more admixture than those estimated without adding a proper reference for the wild population (Wilcoxon signed rank test, p < 10–6). This result confirmed the pattern observed in the simulations.

Given the biases observed with STRUCTURE, we carried out additional analyses with simulated data corresponding to ancestral populations differing by FST = 0.1 and 100 biallelic loci with ADMIXTURE, Ohana and sNMF to assess if the same biases were present in all cases. We graphically compared results from the four programs and STRUCTURE exhibited the worst performance under the set of conditions that we evaluated (Figure S10). The Wilcoxon signed rank test confirmed that q values were not overestimated with ADMIXTURE, Ohana and sNMF when the hybridization rate was 1%.

4 DISCUSSION

Our results show that when appropriate reference samples are not included in the analyses, ancestry estimates provided by STRUCTURE can be very biased. In fact, the results provided by this software can be very similar even when comparing populations with completely different levels of introgression (Figure 4).

When populations experience hybridization and admixture during multiple generations, the proportion of the genome deriving from the ancestral local population tends to be overestimated by STRUCTURE, leading to an underestimation of the real degree of introgression and of the number of admixed individuals. Although the precision of the ancestry estimates provided by STRUCTURE tends to improve with higher number of markers and alleles (as suggested by previous studies; McFarlane & Pemberton, 2019; Vähä & Primmer, 2006), this increase in precision does not correspond to an increase in accuracy and similar biases are observed using a small or a larger number of markers. The overestimate is particularly extreme in scenarios of hybrid swarms, as simulated with a hybridization rate of 5%: while none of the individuals was free of introgression, the estimation offered by the program suggested that purebred individuals were majority in the sample.

The poor performance of STRUCTURE under the simulated scenarios could impact the interpretation of admixture patterns in deeply introgressed populations, affected by various generations of hybridization, as in contact zones (Baldassarre et al., 2014; Johnson et al., 2015; Ortego et al., 2018). Extensive admixture could also occur in other cases, such as in the intercrossing between wild and domestic species, which have been coexisting for centuries or decades, or between different wild species that become in contact after a long time of evolution in isolation (Beaumont et al., 2001; Burgarella et al., 2018; Glover et al., 2017; Mckelvey et al., 2016; Scarcelli et al., 2017). It may seem that a hybridization rate of 5%, as in our models, is unlikely to exist in nature. However, similar values have been reported in the literature based on the identification of F1 hybrids (Lorenzini et al., 2014; Muñoz-Fuentes et al., 2007; Pacheco et al., 2017; Sullivan et al., 2016).

The limitations and the poor performance of STRUCTURE in admixed populations were already in part highlighted in the recommendations of Pritchard et al. (2000) about a proper utilization of the software to obtain reliable ancestry estimates. These authors indicate that, in cases of extensive admixture, STRUCTURE cannot estimate ancestral allele frequencies and it cannot give accurate estimates of q because of the high variance in how many of the individual's alleles derive from one or the other population. Our results confirm this and show that introgression leads to poor estimates of the ancestral allele frequencies, as we observe how the allele frequencies estimated for the ancestral population were increasingly different from the real ancestral frequencies (Figure 5), reflecting the allele frequencies for an already admixed population. The biases in the estimates of ancestry persist even in the cases with 100 loci of high polymorphism (Figure 2), suggesting that the estimation of ancestral allele frequencies may not be corrected just by increasing the number of unlinked loci. After repeated introgression during several generations, STRUCTURE may be unable to reliably reconstruct ancestral allele frequencies. Including purebred reference individuals in the data set resulted in a larger improvement in the estimates of q than just increasing the number of markers. However, even replacing 30% of the individuals from the admixed population A by reference individuals was not enough to completely remove the bias (Figure 6) and the best estimates were obtained when analysing just a few target individuals together with many reference individuals (Figure 7).

In order to develop cost-effective genetic tools for the assessment of introgression, efforts have been concentrating in the identification of the most suitable combination of markers. A reduced number of informative loci with high diagnostic power could be as effective as a high number of less informative loci, indicating that the discriminating power could be more important than their number (Oliveira et al., 2015; Randi et al., 2014). However, it is not completely clear what would be the best strategy to identify the most informative loci without previous data on marker variability across populations. On the other hand, in our simulations we considered all loci to be unlinked. The use of linked markers could also improve the identification of older admixture between populations (Falush et al., 2003; Lecis et al., 2006) because introgression can differently affect regions of the genome (for example, see Anderson et al., 2009), but may be less suitable to study ongoing hybridization and introgression.

Our study highlights the importance of carrying out simulations in each study case to assess the reliability of estimates. The level of introgression can be assessed combining different approaches and simulations trying to properly reflect the functioning of the study system (McFarlane & Pemberton, 2019) should be carried out to test the power of the analysis, determine accuracy and assess possible biases (for example, see Oliveira et al., 2008; Randi et al., 2014; Sanchez-Donoso et al., 2014 Sanz et al., 2009).

The estimates by STRUCTURE show a remarkable increase in accuracy when many nonadmixed reference individuals are included in the analysis (Figure 6). Therefore, we stress the importance of including samples from reference nonadmixed populations (for example, museum specimens dating from before an admixture event; see also Sanchez-Donoso et al., 2014) to increase the reliability of STRUCTURE analyses. The availability of these samples can be limited, and historical DNA extraction and amplification can be costly and effort-demanding, but the accuracy in the analysis improves notably, increasing the reliability of the results. When only a limited number of reference samples is available, it could be useful to carry on multiple STRUCTURE analyses including only a small proportion of those samples whose ancestry is unknown (Figure 7), using USEPOPINFO to define the reference samples and comparing the results obtained with different test sample sets. Also, the use of the option POPALPHAS can improve the analyses when source populations are unequally represented and if there is unbalanced sampling (see Wang, 2017). The use of a small number of markers is also known to influence the results of STRUCTURE (Toyama et al., 2020). Nevertheless, Lawson et al. (2018) have shown that different demographic histories can lead to identical results suggesting admixture in STRUCTURE and emphasize the importance of combining analytical approaches to obtain a more robust analysis of recent demographic history. Our comparison of the results provided by different programs showed that not all of them suffer the same biases or to the same degree. Consequently, we strongly suggest to combine STRUCTURE with other approaches and simulations, and evaluate the consistency in the results, especially when it is not possible to include suitable reference samples in the analyses. This is especially important, for example, in the cases where hybridization and introgression can be relevant in the design of management and conservation plans.

In ecology, conservation and evolutionary biology, it is important to efficiently identify hybrids and admixed individuals, as well as to determine gene flow among different populations. STRUCTURE analyses showed a tendency to classify admixed individuals as nonadmixed when reference samples were not included. The misidentification of the degree of introgression can impact our understanding about the evolutionary history of species or the risks of genetic homogenization and extinction, and therefore its implications for management and conservation plans should be carefully considered.

ACKNOWLEDGEMENTS

We thank Carlos Rafii for providing access to servers to run simulations and analyses. Part of the computer work was carried out in Genomics servers of the Doñana's Singular Scientic-Technical Infrastructure (ICTS-RBD). The Conservation and Evolutionary Genetics Group at the Estación Biológica de Doñana provided valuable support and comments on the manuscript and Dr Jennifer Leonard also helped us to improve the English. The study is supported by project CGL2016-75227-P to CV from the Spanish Government and an FPI (Formación de Personal Investigador) fellowship (BES-2017-081291) to SR. We are also grateful to the anonymous referees and associate editors for their constructive comments.

    AUTHOR CONTRIBUTIONS

    C.V., and I.S.-D. designed the study, S.R. carried out the project and wrote the scripts, S.R., and I.S.-D. analysed the data, S.R. wrote the first draft of the manuscript and all authors contributed to the text. This study was initiated as part of the Master in Biodiversity and Conservation Biology of S.R. at the University Pablo de Olavide (Seville, Spain).

    DATA AVAILABILITY STATEMENT

    Genotypes corresponding to simulated starting populations, scripts used to simulate introgression and sampling, and output of these scripts that were used for subsequent analyses (as well as R scripts used to analyse and plot data) are available at https://github.com/sararvg/introgression_structure. Quail data is available at the original publication (Sanchez-Donoso et al., 2012).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.