Identification of determinants of pollen donor fecundity using the hierarchical neighborhood model
Abstract
Individual differences in male reproductive success drive genetic drift and natural selection, altering genetic variation and phenotypic trait distributions in future generations. Therefore, identifying the determinants of reproductive success is important for understanding the ecology and evolution of plants. Here, based on the spatially explicit mating model (the neighborhood model), we develop a hierarchical probability model that links co-dominant genotypes of offspring and candidate parents with phenotypic determinants of male reproductive success. The model accounts for pollen dispersal, genotyping errors as well as individual variation in selfing, pollen immigration, and differentiation of immigrant pollen pools. Unlike the classic neighborhood model approach, our approach is specially designed to account for excessive variation (overdispersion) in male fecundity. We implemented a Bayesian estimation method (the Windows computer program available at: https://www.ukw.edu.pl/pracownicy/plik/igor_chybicki/1806/) that, among others, allows for selecting phenotypic variables important for male fecundity and assessing the fraction of variance in fecundity (R2) explained by selected variables. Simulations showed that our method outperforms both the classic neighborhood model and the two-step approach, where fecundities and the effects of phenotypic variables are estimated separately. The analysis of two data examples showed that in wind-pollinated trees, male fecundity depends on both the amount of produced pollen and the ability to pollen spread. However, despite that the tree size was positively correlated with male fecundity, it explained only a fraction of the total variance in fecundity, indicating the presence of additional factors. Finally, case studies highlighted the importance of accounting for pollen dispersal in the estimation of fecundity determinants.
1 INTRODUCTION
Individuals in a sexually reproducing population can differ substantially in their reproductive success. The uneven contribution of individuals to the next generation is involved in genetic drift and natural selection mechanisms, altering genetic variation and phenotypic trait distributions in future generations. Therefore, understanding the determinants of reproductive success takes a central position in ecology, evolutionary and conservation biology (Knight, 2003; Lyons et al., 1989; Winter et al., 2008).
Much of the emphasis has been placed on identifying phenotypic variation that determines individual male reproductive success in plant populations (Meagher, 1991; Morgan & Conner, 2001; Smouse & Meagher, 1994; Smouse et al., 1999; Snow & Lewis, 1993). To overcome difficulty related to the hidden nature of male reproductive success, marker-assisted paternity analysis has been used as a standard tool for estimating numbers of effective pollen gametes contributed by pollen donors (Ashley, 2010; Snow & Lewis, 1993). In a simplistic approach, these numbers can be regressed on phenotypic variables to identify determinants of male reproductive success (Born et al., 2008; Hopley et al., 2015; Setsuko & Tomaru, 2011; Tambarussi et al., 2015). However, a fixed spatial distribution of adult plants together with spatially restricted pollen dispersal impose natural constraints on mating opportunities, generating a nonrandom structure of pollen gametes captured by individual mother plants even under uniform fecundity (Meagher, 1991; Smouse et al., 2001). Therefore, it has become evident that a number of sired seeds is actually a function of two major factors acting simultaneously, that is pollen dispersal capability and male reproductive potential (or male fecundity; Meagher, 1991; see also Figure 2B in Klein et al. (2008) for schematic illustration). These factors cannot be separated easily, and, instead, they need to be simultaneously accounted for in the analysis of determinants of male reproductive success.
With the advent of the neighborhood model, the effects of pollen dispersal and fecundity have started to be explicitly taken into account (Adams & Birkes, 1989). The model came with the brilliant idea of embedding a regression analysis of variable effects into the fractional paternity analysis, where all possible pollen sources are modelled using a compositional approach. Consequently, any uncertainty about the pollen source contributes automatically to the precision of estimates of regression parameters. The neighborhood model approach allowed combining genetic and nongenetic information to increase the power of the paternity assignment (Hadfield et al., 2006). Therefore, besides the practical benefit, the neighborhood model offered a substantial statistical advantage over the approach based on a simple paternity assignment.
In the original neighborhood model, hereafter called the classic neighborhood model, the effects of phenotypic characters on male fecundity were assessed using a fixed-effects model. In this approach, slopes of a regression function (selection gradients; Lande & Arnold, 1983) were estimated under the assumption that variation in reproductive success follows the variation expected for a multinomial distribution with probabilities being a (soft-max) function of a linear combination of measured phenotypic variables. Hence, any excessive variance, often termed overdispersion, in male reproductive success can lead to statistical issues, namely to the elevated risk of false positives (or an extreme anticonservatism; Hadfield et al., 2010), as first reported by Klein and colleagues (2008, 2011). Although overdispersion can be stochastic, it often results from variables that contribute to fecundity variation but are not included in the regression function.
The classic neighborhood model has been adjusted specifically to avoid various problems occurring in the case of real data (Burczyk & Chybicki, 2004; Chybicki, 2018; Chybicki & Burczyk, 2010; Gérard et al., 2006; Oddou-Muratorio et al., 2005). However, none of these modifications treated the built-in regression analysis significantly. In this respect, an important step forward was the approach of Klein et al. (2008), who proposed to treat individual fecundities as random variables characterized by the arbitrarily assumed probability distribution, replacing the fixed-effects model of fecundity with a random-effects model. Using the hierarchical Bayesian approach, individual fecundities were made estimable parameters along with dispersal kernel and the other components of the model. The simulation showed that the model effectively captures the actual variance in male fecundity (Klein et al., 2011), resolving the problem of occasional overdispersion.
Despite a clear statistical advantage, the random-effects model may be considered as a half-step forward, especially if a study aims to identify phenotypic effects rather than variation in fecundity alone. It is because the estimation of selection gradients requires estimates of individual fecundities in the first step to proceed with a regression analysis in the second step. Although such an approach can be generally valid, it certainly under-exploits available data compared to the classic neighborhood model because potential determinants of fecundity are completely ignored in the estimation of fecundity. As a result, the effectiveness of the two-step approach based on the random-effects model may be questioned.
This study aimed at extending the classic neighborhood model in order to quantify effects of phenotypic variables on male reproductive success in the face of overdispersion in male fecundity. Drawing advantages from both the classic neighborhood model and the random-effects model as well as from recent statistical developments (Chybicki et al., 2019), we elaborated a new approach to study ecological determinants of male reproductive success in plant populations. The new approach was subjected to a simulation study to reveal basic statistical properties related to sampling design. In addition, the new approach was compared to the classic neighborhood model and the two-step approach based on the random-effects model. Finally, two real data examples were analysed to illustrate the potential of the method in studying plant populations.
2 MATERIALS AND METHODS
2.1 The hierarchical neighborhood model


















2.1.1 Pollen source probabilities
The classic neighborhood model assumes that ,
and
are the same for all sampled families. Here, as in earlier studies (Chybicki & Burczyk, 2013; Chybicki et al., 2019; Tani et al., 2015), we assumed that these probabilities might show inter-family variation. To do so, a vector
for the j-th family was assumed to follow a Dirichlet distribution with three hyper-parameters
.
2.1.2 Background population
Background pollen donors are incorporated into the model as allele frequencies in the background population (). In order to reflect potential differences in background pollen pools among families, for the j-th mother and the l-th locus, the vector of allele frequencies
was assumed to follow a Dirichlet distribution with parameters
, where F is the divergence rate and
is the frequency of the 1st allele in the l-th locus in a global background population (Chybicki, 2013). Here, F is very similar to the global
parameter in the TwoGener approach (Smouse et al., 2001) that captures the differentiation of background pollen pools between families.
2.1.3 Male reproductive success
















The male fecundity was assumed to be a random variable, such that the logarithm of fecundity is normally distributed (see Klein et al., 2008). Specifically, we assumed that the expected log-fecundity of the k-th individual is
, that is
, where
is the mean and
is the standard deviation.
2.1.4 Selection gradients
In order to incorporate the effects of phenotypic variables on male reproductive potential, was assumed to be a linear combination of variables
, where
is the m-th (normalized) variable measured for the k-th individual. Because the neighborhood model implements the expected proportion of pollen gametes of the k-th male in a given maternal pollen pool, the
term in the equation (3) appearing both in the numerator and denominator needs to be defined up to a constant. For this reason,
lacks a constant term, which would eventually cancel out in (3). It follows that
can be used as a reference mean value of the relative log fecundity so that
and
means that the k-th individual has greater and lower fecundity than the average-valued male. The null model for covariates, when
, is equivalent to the assumption
.
The parameter models overdispersion, that is the variation in male fecundity that is not captured by the regression function. Consequently, for
diminishing to zero, the hierarchical neighborhood model becomes equivalent to the classic neighborhood model, that is
, which, therefore, can be treated as a special case of the hierarchical model.
2.2 Estimation
The hierarchical neighborhood model has multiple estimable parameters. Parameters of interest include pollen source probabilities, dispersal kernel parameters, background pollen divergence, and selection gradients. In addition, allele frequencies in background pollen pools need to be estimated as well, but they are usually less important in empirical studies (they are nuisance parameters). Individual fecundity values can be treated as either nuisance parameters or goal parameters, depending on the context.

Nonetheless, in the hierarchical model, selection gradients as well as , {
,
do not enter to the likelihood function. To make the estimation of these parameters possible, a hierarchical Bayesian approach can be used. In this regard, our study followed the methods developed earlier (Chybicki, 2013; Chybicki & Burczyk, 2013; Klein et al., 2008). These methods implemented the Markov chain Monte Carlo (MCMC) algorithm to approximate the posterior distribution of parameters. Here, a Bayesian variable selection procedure was additionally implemented using the reversible-jump MCMC (RJMCMC) algorithm (details are given in the Appendix S1), similar to that described in Chybicki et al. (2019). The RJMCMC algorithm was used to narrow the focus to parsimonious regression models that tend to include important selection gradients only. The algorithm allows estimating the posterior probabilities of the regression models. Hence, one is able, quite automatically, to select those effects that remain in the model with the highest probability.
For a rough assessment of a regression model's quality, the fraction of the total variance that is captured by the model (R2) is usually of interest. While a Bayesian R2 cannot be derived from the standard definition without problems, an empirical R2 equivalent can be computed easily following Gelman et al. (2019), as , where
is the variance of the modelled predicted means, and
is the modelled residual variance. In our case, this approach results in
, where
is the variance estimator of a variable indexed from k = 1 to NA, and
. Therefore, the estimation algorithm was designed to estimate
along with the model parameters.
2.3 Simulations
Because the method described here is an extension of the existing methods, many of its properties were already tested in simulations (Burczyk et al., 2002; Burczyk & Chybicki, 2004; Burczyk & Koralewski, 2005; Chybicki, 2013, 2018; Chybicki et al., 2019; Klein et al., 2011, 2013). Therefore, our simulations were designed to focus on (i) the ability of the novel method to properly select and quantify selection gradients in the face of different levels of residual variance as well as different sampling designs, (ii) comparisons between methods based on the classic neighborhood model and the hierarchical neighborhood model in the face of nonzero residual variance and (iii) comparisons between the hierarchical approach and the two-step approach based on estimated fecundities (i.e. the null hierarchical model + a separate regression analysis).
The reference simulation algorithm began with generating hermaphrodite plants. For each plant, spatial co-ordinates were drawn randomly within a quadrate to get a population density of 20 individuals per hectare. Then, genotypes were randomly assigned at 16 loci, each with 8 alleles. Finally, for each plant, three independent phenotype variables were drawn from a normal distribution. Phenotypes were normalized to get mean 0 and variance 1.
plants were drawn to serve as mother plants. To simulate a sample of offspring, 20 offspring individuals per mother (
) were generated as in Chybicki (2018). First, for each (k-th) candidate parent, the (expected) log-fecundity
was drawn from the normal distribution with the mean
and the standard deviation
, where
,
and
. Thus, the first variable had a strong positive effect; the second variable had a moderate negative effect, while the third variable did not affect male fecundity. Then, each candidate father was assigned the expected proportion of gametes in a local pool of outcross pollen of a given maternal individual. To simulate pollen dispersal, the shape parameter of the exponential-power function was set to 0.5, while the scale parameter was adjusted to get the mean distance of pollen dispersal of 50 metres. Each offspring in the j-th family was randomly assigned as a result of either self-pollination (with the probability
), pollen immigration (with the probability
), or local outcross pollination (with the probability
). For the j-th family, the vector
was drawn from a Dirichlet distribution with parameters
,
and
. The average probability of self-pollination, pollen immigration, and local outcrossing was 0.020, 0.196, and 0.784, and 90% values laid within 0–0.055, 0.025–0.443 and 0.536–0.965, respectively. For an offspring with a local outcross origin, the father was drawn at random according to the expected proportions, computed based on both pollen dispersal kernel and fecundity. Offspring genotypes were generated assuming the Mendelian laws. Background allele frequencies for each family were drawn from the Dirichlet distribution assuming
and
(for
from 1 to 8). As a result, a sample eventually consisted of
100 parents (including mothers) and
400 progeny. Due to pollen immigration and self-pollination, about 300 progeny contained information about the distribution of local pollen dispersal and male fecundity.
We first aimed to study the effect of unexplained variance in fecundity. For this purpose, data were generated as in the reference simulation modifying to get different overdispersion levels. The resulting
values for
concentrated on 0.95 (±0.01), 0.83 (±0.02) and 0.56 (±0.04), respectively. Additionally, to study the omitted variable effect, we used the reference data with the 1st phenotypic variable (
) replaced by a random normal deviate.
To study the impact of sampling effort on the quality of estimates, the reference algorithm was modified as follows. First, we focused on the impact of progeny sample size so that instead of = 20, we simulated
= 10 or 40 to get the total number of progeny
= 200 or 800 (instead of 400). Subsequently, we studied a potential effect of the proportion of the offspring to adults by setting
to
,
and
. To assess whether the number of families alters the quality of estimates, instead of
= 20, data were generated assuming
= 10 or 40, but the total number of offspring remained as in the reference simulation.
For each scenario, 100 replicates were simulated and subjected to the analysis with the NM2F software (Chybicki & Burczyk, 2013), modified appropriately to incorporate the RJMCMC algorithm. The posterior distribution was approximated with 100,000 MCMC updates (keeping every 20th update), after 20,000 initial iterations for pilot adjusting. The marginal posterior distributions of parameters were computed based on the subset of MCMC samples representing the most probable regression model. Estimates of selection gradients, the standard deviation and
were characterized by the bias, the mean squared error (MSE), and the coverage of the 95% highest posterior density interval (HPDI) approximated as the shortest interval containing 95% parameter draws. In the case of
, the expected value was computed empirically for each simulated data using
, where
is the expected residual of male fecundity for the k-th individual. The model selection procedure was summarized with the frequency of selection of the true vs. the null regression model and the frequency of selection of a regression model, including the m-th phenotypic variable. In this way, the frequencies of false positives and false negatives were assessed.
The hierarchical neighborhood model was designed to deal with overdispersion in male fecundity that is not explained by selection gradients implemented in the classic neighborhood model. In order to compare the two methods under overdispersion in fecundity, data were generated as for the reference scenario, except that (i) pollen source probabilities were assumed uniform across families, (ii) background pollen pools were assumed uniform across families (i.e.
= 0) and equal to the frequencies in the simulated population. These modifications allowed simulated data to meet exactly the assumptions of the classic neighborhood model implemented in the NMπ software (Chybicki, 2018) so that the unexplained variance in fecundity remained the only source of potential differences between methods.
Three dispersion levels were simulated, that is . Then, data were analysed with NMπ (the maximum likelihood approach, the classic model) and NM2F (the Bayesian approach, the hierarchical model) software in parallel. In the case of NMπ, estimates were derived based on the best model after model selection performed using the likelihood ratio test. Because this part of the simulation study required manual handling with the software, 20 replicates per scenario were analysed only.
The hierarchical method performs the estimation of the effects of phenotypic variables on male fecundity in a single estimation step. Nonetheless, it is also possible to run the two-step analysis, where individual fecundities are first estimated based on the null model (equivalent to that developed by Klein et al., 2008) and subsequent regression analysis is performed to estimate a regression model for fecundity (as in Oddou-Muratorio et al., 2018). Because the quality of fecundity estimates depends on the amount of information used in a model, the hierarchical approach and the two-step approach can differ in this respect. Therefore, we compared the quality of estimated regression parameters based on the hierarchical approach and the two-step approach. We focused on simulated data for the three levels of sampling effort, that is . For each level, we analysed 20 replicates, assuming the null regression model (with slopes fixed at zero). Posterior median estimates of individual male fecundity were then used to run a stepwise regression analysis.
2.4 The case studies
Two real data sets were analyzed to demonstrate the capabilities of the new method for different sampling designs. In both cases, the analysis was conducted primarily using the new method (the hierarchical neighborhood model). In addition, data were analysed with the classic neighborhood model as well as the two-step approach, where individual fecundities and regression slopes are estimated separately. In these additional analyses, we focused on the estimates of regression slopes, treating the estimates based on the hierarchical model as a reference.
2.4.1 Norway spruce
The first example was the clonal seed orchard of Norway spruce (Picea abies (L.) Karst.) (Dering et al., 2014). Despite the study plot did not represent a natural population, the study system possessed all major characteristics of a typical study designed for analyzing plant mating patterns. The trees produced seeds after unconstrained pollination, and there was no isolation from external pollen sources that reached 58% on average (Dering et al., 2014). According to paternity analysis of seed crop in three mast seasons (in 1996, 2004 and 2006), individual trees tend to contribute differently to maternal pollen pools, with 50% of the total seed crop produced by ca. 10% trees. Paternity analysis revealed that the distance between mates is a significant factor in mating success. However, the impact of phenotypic variables was not studied.
The sample contained 447 trees, including five mother trees, characterized by microsatellite genotypes and spatial co-ordinates. In addition, each tree was characterized with three variables that are potentially related to male fecundity, that is male strobili abundance (i.e. the estimated number of male cones based on field observations), tree height (a field measurement) and crown volume (estimated based on measurements of tree height, first branching height and crown projection). For genotyping, five microsatellite loci were used (121 detected alleles in total), yielding the combined exclusion probability of 0.9879. Because field measurements were taken in the single mast year (2006), only seeds collected that year were analyzed in the case study (500 seeds).
To match the reality of the study plot and get the extra information about mating patterns, the hierarchical neighborhood model was modified as follows. To account for possible directionality in pollen dispersal, the exponential-power-von Misses dispersal kernel was assumed (see Chybicki, 2018 for the equation). Hence, the dispersal kernel had two additional parameters: – the rate of anisotropy and
– the prevailing direction of dispersal. Because the von Misses distribution is circular,
was assumed to be uniformly distributed between zero and infinity while
to be uniformly distributed between 0 and 2π. In this way, we avoided potential problems with the posterior distribution due to equivalence of (
,
) and (
,
). Parent–offspring genotype mismatches were treated as in the NMπ software (Chybicki, 2018), with locus-specific error rates assumed to follow the negative exponential distribution truncated at 1. Finally, the Dirichlet distribution for the background pollen pools was assumed to have family-specific divergence parameters
, with the exponential distribution, truncated at 1, taken as a prior (Chybicki, 2013).
Despite that the assumption of nonrandom pollen dispersal was reasonable in the light of earlier study (Dering et al., 2014), we also run the analysis based on the assumption of completely random dispersal. In this case, was fixed at zero, and the dispersal function
was fixed at 1. Hence, in the ‘random dispersal’ model, individual reproductive success depended only on fecundity parameters. To compare a relative predictive fit between the two models, we computed the widely applicable information criterion (WAIC) (Watanabe, 2010), which accounts for a different number of parameters between models and is well suited to hierarchical models (Gelman et al., 2014). The model with lower WAIC was chosen as the one characterized by better relative goodness of fit.
To get estimates of regression slopes under the classic neighborhood model, data were analysed with the NMπ software (Chybicki, 2018). Similarly, as in the above analysis, we set the following parameters as being estimable: pollen immigration (m), self-fertilization (s), exponential-power-von Misses dispersal kernel (δ, b, κ, α0), mistyping errors (ε1, …, ε5) and regression slopes (β1, β2, β3). The selection of the best-fitting regression model was based on the forward selection. Estimates for regression slopes under the two-step approach were obtained as in the simulation study; that is, individual log fecundities (), estimated based on the same model as described above except for the assumption
(no selection gradients involved), were used as a response variable in the stepwise regression analysis with the standardized phenotypes taken as explanatory variables.
2.4.2 English yew
The second example was microsatellite data collected to study seed and pollen dispersal patterns within a natural English yew population (Taxus baccata L.), a European dioecious conifer (Chybicki & Oleksa, 2018). The earlier study showed that pollen dispersal was nonrandom and followed a leptokurtic dispersal kernel. In addition, male reproductive success was linked with trunk diameter (with the slope of 0.278 ± 0.103). However, since the slope was estimated ignoring potential overdispersion in male fecundity, it might represent a false-positive result.
In contrast to the spruce study, here, progeny were sampled at the stage of naturally regenerated seedlings, without prior knowledge about maternity. However, thanks to the parentage analysis conducted with NMπ (Chybicki, 2018) and taking advantage of separate sexes, 121 out of 220 seedlings were assigned the mother tree with the probability ≥0.8. These seedlings were used as a sample of progeny with known mothers but unknown fathers. As a result, we identified 55 maternal families, with the number of progeny ranged between 1 and 10 (2.2 on average). As a local population of candidate pollen donors, 128 male trees were subsampled from the original data set.
Because of a low number of progeny per mother, we assumed that pollen source probabilities were common across families (i.e. without no variation among families). In addition, because of dioecy, we fixed the probability of self-fertilization at zero. Consequently, we took a uniform Beta distribution as a prior for pollen immigration (m) vs. local pollination (c). Also, because the pollen immigration appeared to be close to zero in the extracted subset of progeny (see the results), we estimated neither allele frequencies in the background population nor their divergence rate and, instead, we assumed that these frequencies were equal to frequencies in the sample of trees. Finally, because the sampled seedlings represented a mixture of multiple regeneration seasons, we assumed that the pollen dispersal kernel was isotropic.
The analysis was run twice, that is under the assumption of random vs. nonrandom pollen dispersal. The predictive fit of the two models was compared using WAIC. Similarly, as in the spruce case study, the regression slope was estimated using the NMπ software and the two-step approach.
3 RESULTS
3.1 Simulations
3.1.1 The effect of overdispersion
Generally, the simulations showed that the hierarchical neighborhood model correctly identified the regression model for male fecundity (Table 1). The increase in overdispersion resulted mostly in the increase of false negatives for the medium-level slope coefficient (). In addition, false positives appeared to increase (i.e.
was included in the model) as the overdispersion level increased, but the frequency of false positives remained at an acceptable level of 1%–2% only. As for the quality of parameter estimates, the slope parameters
were estimated with very little or no bias. In addition, their credible intervals showed good coverage. The level of overdispersion in fecundity (
) appeared to be less accurately estimated, especially when
. On the other hand, the empirical estimate of model fit (
) was in a relatively good agreement with the expected value. Only, the coverage of credible interval for
appeared to decrease slightly under the nominal level for the highest overdispersion. Regardless of the level of overdispersion, the method never resulted in the selection of the null model.
Scenario | Parameter | Bias | MSE | Coverage | f(![]() ![]() |
---|---|---|---|---|---|
Low overdispersion in fecundity | ![]() |
0.018 | 0.007 | 0.96 | 1.0 |
![]() |
−0.001 | 0.008 | 0.95 | 0.99 | |
![]() |
0.000 | 0.000 | 1 | 0 | |
![]() |
0.090 | 0.012 | 0.91 | ||
![]() |
−0.033 | 0.002 | 0.94 | ||
f(True model) = 0.99 | |||||
Medium overdispersion in fecundity | ![]() |
0.002 | 0.011 | 0.95 | 1.0 |
![]() |
−0.015 | 0.009 | 0.95 | 1.0 | |
![]() |
0.003 | 0.001 | 0.99 | 0.01 | |
![]() |
0.018 | 0.007 | 1 | ||
![]() |
−0.006 | 0.002 | 0.98 | ||
f(True model) = 0.99 | |||||
High overdispersion in fecundity | ![]() |
0.041 | 0.028 | 0.96 | 1.0 |
![]() |
0.021 | 0.051 | 0.82 | 0.86 | |
![]() |
−0.007 | 0.002 | 0.98 | 0.02 | |
![]() |
0.033 | 0.025 | 0.95 | ||
![]() |
0.012 | 0.008 | 0.88 | ||
f(True model) = 0.85 | |||||
Omitted variable | ![]() |
0.000 | 0.000 | 1 | 0 |
![]() |
0.049 | 0.054 | 0.80 | 0.83 | |
![]() |
0.008 | 0.004 | 0.98 | 0.02 | |
![]() |
– | – | – | ||
![]() |
0.001 | 0.009 | 0.81 | ||
f(True model) = 0.81 |
- f(
in
) – the frequency of selection of a regression model including
,
– the slope of the effect (a regression parameter) for the
-th variable,
– the standard deviation of the normal distribution that captures the overdispersion in fecundity,
–the fraction of variance explained by the regression model, f(True model) – the frequency of selecting the true model.
- a The actual value of overdispersion parameter is unknown because the omitted variable introduced extra unexplained variance.
The effect of the omitted variable was similar to the effect of increased overdispersion (Table 1). Because the omitted variable had a strong effect, the excess in the variation of the realized fecundity effectively weakened the ability of the method to properly identify the effect of the moderate-effect variable. As a result, we observed the increased frequency of selection of the null model (17%) instead of the actual model with one variable. On the other hand, the omitted variable scenario resulted in a very low frequency of false positives. Also, the variance proportion explained by the regression model was estimated without bias. The relatively low coverage level for (81%) followed exactly the frequency of selection of the true model. Interestingly, for the subset of results where the best model was different from the null model, the coverage for
increased to 97% (Results not shown).
3.1.2 The effect of sampling design
Simulations showed that the effect of the number of progeny on the quality of estimates of parameters related to the regression model is only moderate (Table 2). It was mainly manifested in the increased MSE. Because parameter estimates showed little bias, the increased MSE was mostly due to the increased variance of estimates. However, the quality of credibility intervals remained comparably good regardless of the sampling effort. A slightly decreased frequency of selection of the true model was another effect of the reduced number of progeny. Nonetheless, even for the sample size of 200 progeny individuals, the model selection procedure resulted in 95% accuracy. The reduction of progeny number also resulted in a slight decrease in the frequency of identification of the moderate effect, leading to 4% of false negatives. On the other hand, the frequency of false positives remained almost unaffected.
Number of progeny | Parameter | Bias | MSE | Coverage | f(![]() ![]() |
---|---|---|---|---|---|
200 |
![]() |
0.010 | 0.018 | 0.92 | 1 |
![]() |
−0.006 | 0.021 | 0.94 | 0.96 | |
![]() |
0.003 | 0.001 | 0.99 | 0.01 | |
![]() |
0.062 | 0.018 | 0.97 | ||
![]() |
−0.030 | 0.006 | 0.93 | ||
f(True model) = 0.95 | |||||
f(Null model) = 0 | |||||
400 |
![]() |
0.002 | 0.011 | 0.95 | 1 |
![]() |
−0.015 | 0.009 | 0.95 | 1 | |
![]() |
0.003 | 0.001 | 0.99 | 0.01 | |
![]() |
0.018 | 0.007 | 1 | ||
![]() |
−0.006 | 0.002 | 0.98 | ||
f(True model) = 0.99 | |||||
f(Null model) = 0 | |||||
800 |
![]() |
0.011 | 0.007 | 0.92 | 1 |
![]() |
−0.005 | 0.007 | 0.98 | 1 | |
![]() |
0.000 | 0.000 | 1 | 0 | |
![]() |
0.015 | 0.005 | 0.97 | ||
![]() |
−0.001 | 0.001 | 0.95 | ||
f(True model) = 1.00 | |||||
f(Null model) = 0 |
- f(
in
) – the frequency of selection of a regression model including
,
– the slope of the effect (a regression parameter) for the
-th variable,
– the standard deviation of the normal distribution that captures the overdispersion in fecundity,
– the fraction of variance explained by the regression model, f(True model) – the frequency of selection of the true model.
Generally, doubling the number of progeny per adult increased the frequency of selection of the true model and decreased the bias and MSE of estimates. Nonetheless, the relative sampling effort, in terms of the number of progeny to the number of adults, appeared to have little effect on the selection of the true regression model and the quality of estimates of variable effects, especially when compared against the sole effect of the number of adults in a sample. Whereas, for the fixed number of progeny (200 or 400), the sampling scenario with 50 adults in a sample was characterized by a lower inclusion probability of the moderate variable (Table 3) compared to the scenario with 100 adults (Table 2).
Number of adults | Number of progeny | Parameter | Bias | MSE | Coverage | f(![]() ![]() |
---|---|---|---|---|---|---|
50 | 200 |
![]() |
0.036 | 0.019 | 0.99 | 1 |
![]() |
0.029 | 0.038 | 0.87 | 0.89 | ||
![]() |
0.004 | 0.001 | 0.99 | 0.01 | ||
![]() |
0.074 | 0.021 | 0.97 | |||
![]() |
−0.029 | 0.007 | 0.96 | |||
f(True model) = 0.88 | ||||||
100 | 400 |
![]() |
0.002 | 0.011 | 0.95 | 1 |
![]() |
−0.015 | 0.009 | 0.95 | 1 | ||
![]() |
0.003 | 0.001 | 0.99 | 0.01 | ||
![]() |
0.018 | 0.007 | 1 | |||
![]() |
−0.006 | 0.002 | 0.98 | |||
f(True model) = 0.99 | ||||||
50 | 400 |
![]() |
0.031 | 0.014 | 0.98 | 1 |
![]() |
−0.004 | 0.026 | 0.87 | 0.96 | ||
![]() |
−0.004 | 0.002 | 0.99 | 0.01 | ||
![]() |
0.038 | 0.012 | 0.99 | |||
![]() |
−0.001 | 0.004 | 0.94 | |||
f(True model) = 0.95 | ||||||
100 | 800 |
![]() |
0.011 | 0.007 | 0.92 | 1 |
![]() |
−0.005 | 0.007 | 0.98 | 1 | ||
![]() |
0.000 | 0.000 | 1 | 0 | ||
![]() |
0.015 | 0.005 | 0.97 | |||
![]() |
−0.001 | 0.001 | 0.95 | |||
f(True model) = 1.0 |
- f(
in
) – the frequency of selection of a regression model including
,
– the slope of the effect (a regression parameter) for the
-th variable,
– the standard deviation of the normal distribution that captures the overdispersion in fecundity,
– the fraction of variance explained by the regression model, f(True model) – the frequency of selection of the true model.
In simulations, we observed no clear tendency for bias and MSE as well as for the model selection that could correspond with the change in the number of maternal families (Table 4). To some extent, the coverage tended to decrease, and the frequency of false positives tended to increase as the number of families increased. However, differences between scenarios were too small to make robust conclusions in this respect.
Number of families | Parameter | Bias | MSE | Coverage | f(![]() ![]() |
---|---|---|---|---|---|
10 |
![]() |
0.028 | 0.012 | 0.99 | 1 |
![]() |
−0.004 | 0.006 | 0.99 | 1 | |
![]() |
0.006 | 0.002 | 0.98 | 0.02 | |
![]() |
0.052 | 0.013 | 0.92 | ||
![]() |
−0.016 | 0.003 | 0.96 | ||
f(True model) = 0.98 | |||||
20 |
![]() |
0.002 | 0.011 | 0.95 | 1 |
![]() |
−0.015 | 0.009 | 0.95 | 1 | |
![]() |
0.003 | 0.001 | 0.99 | 0.01 | |
![]() |
0.018 | 0.007 | 1 | ||
![]() |
−0.007 | 0.002 | 0.98 | ||
f(True model) = 0.99 | |||||
40 |
![]() |
0.019 | 0.011 | 0.93 | 1 |
![]() |
0.018 | 0.013 | 0.94 | 0.98 | |
![]() |
0.000 | 0.000 | 1 | 0 | |
![]() |
0.036 | 0.010 | 0.94 | ||
![]() |
−0.015 | 0.003 | 0.95 | ||
f(True model) = 0.98 |
- f(
in
) – the frequency of selection of a regression model including
,
– the slope of the effect (a regression parameter) for the
-th variable,
– the standard deviation of the normal distribution that captures the overdispersion in fecundity,
– the fraction of variance explained by the regression model, f(True model) – the frequency of selection of the true model.
3.1.3 The hierarchical approach vs. the classic approach
In the case of overdispersion in fecundity, the classic approach tended to preserve sensitivity at the cost of specificity in identifying the effects of phenotypic variables on male fecundity (Table 5). In contrast, the hierarchical approach tended to behave the opposite. However, false negatives appeared to show up less frequently in the hierarchical approach than false positives in the classic one. In terms of bias and MSE, the two methods showed comparable quality, suggesting that the estimation procedure (MLE vs. Bayesian approach) does not influence the quality of estimates.
Overdispersion | Parameter | Classic model | Hierarchical model | ||||
---|---|---|---|---|---|---|---|
Bias | MSE | f(![]() ![]() |
Bias | MSE | f(![]() ![]() |
||
![]() |
![]() |
0.024 | 0.006 | 1 | 0.037 | 0.007 | 1 |
![]() |
−0.003 | 0.002 | 1 | −0.002 | 0.002 | 1 | |
![]() |
0.006 | 0.001 | 0.05 | 0 | 0 | 0 | |
f(True model) | 0.95 | 1.00 | |||||
![]() |
![]() |
0.042 | 0.017 | 1 | 0.055 | 0.015 | 1 |
![]() |
0.001 | 0.008 | 1 | −0.019 | 0.005 | 1 | |
![]() |
0.000 | 0.004 | 0.2 | 0 | 0 | 0 | |
f(True model) | 0.800 | 1.00 | |||||
![]() |
![]() |
−0.098 | 0.042 | 1 | −0.056 | 0.022 | 1 |
![]() |
0.080 | 0.034 | 1 | 0.102 | 0.073 | 0.75 | |
![]() |
−0.015 | 0.037 | 0.5 | 0 | 0 | 0 | |
f(True model) | 0.50 | 0.75 |
- f(
in
) – the frequency of selection of a regression model including
,
– the slope of the effect (a regression parameter) for the
-th variable,
– the standard deviation of the normal distribution that captures the overdispersion in fecundity, f(True model) – the frequency of selection of the true model.
3.1.4 The hierarchical approach vs. the regression analysis on estimated fecundities
In contrast to the method based on the hierarchical neighborhood model, the combination of the estimation of individual fecundities and the standard regression analysis (the two-step approach) generated biased slope estimates (Table 6). The two-step approach tended to produce a regression model characterized by the underestimated proportion of the explained variance (R2). However, the two-step approach was successful in identifying the true regression model. Comparing three levels of the sampling effort revealed that the overall quality of estimates based on the two-step approach increased together with the number of progeny in a sample. Nonetheless, assuming the bias in R2 decreases linearly with the logarithm of the number of progeny (see Table 6), for the considered simulation setup, ca. 3000 progeny would be needed to reduce the bias to zero or to achieve the efficiency of the hierarchical approach.
Number of seeds | Parameter | Regression analysis on estimated fecundities | Hierarchical model | ||||
---|---|---|---|---|---|---|---|
Bias | MSE | f(![]() ![]() |
Bias | MSE | f(![]() ![]() |
||
200 |
![]() |
−0.487 | 0.252 | 1 | 0.003 | 0.019 | 1 |
![]() |
0.263 | 0.076 | 0.95 | 0.032 | 0.021 | 1 | |
![]() |
−0.008 | 0.001 | 0.05 | 0 | 0 | 0 | |
![]() |
−0.397 | 0.163 | −0.008 | 0.003 | |||
f(True model) | 0.90 | 1 | |||||
400 |
![]() |
−0.352 | 0.134 | 1 | −0.008 | 0.008 | 1 |
![]() |
0.147 | 0.025 | 1 | −0.046 | 0.010 | 1 | |
![]() |
0.006 | 0.001 | 0.05 | 0 | 0 | 0 | |
![]() |
−0.279 | 0.082 | −0.016 | 0.002 | |||
f(True model) | 0.95 | 1 | |||||
800 |
![]() |
−0.224 | 0.060 | 1 | 0.002 | 0.010 | 1 |
![]() |
0.107 | 0.016 | 1 | −0.018 | 0.006 | 1 | |
![]() |
0.008 | 0.001 | 0.05 | 0 | 0 | 0 | |
![]() |
−0.194 | 0.039 | −0.007 | 0.001 | |||
f(True model) | 0.95 | 1 |
- f(
in
) – the frequency of selection of a regression model including
,
– the slope of the effect (a regression parameter) for the
-th variable,
– the fraction of variance explained by the regression model, f(True model) – the frequency of selection of the true model.
3.2 The case studies
3.2.1 Norway spruce
The model with nonrandom pollen dispersal was characterized by WAIC = 10,674.2, while the model with random pollen dispersal had WAIC =10,706.9. The difference in WAIC of 32.7 between models provided sufficient support for choosing the nonrandom dispersal model as the one with the better predictive fit. Therefore, we focused primarily on estimates under the model with nonrandom pollen dispersal, except for parameters related with the effects of phenotypic variables on male fecundity, where we compared the results under the two models.
The estimated frequencies of genotyping errors spanned between 0.5% and 7.2% (posterior medians) with a mean error frequency of 2.1% (posterior median). The posterior median of the parameter of truncated exponential distribution used as a prior for individual error frequencies was 0.022. The HPD interval spanned between 0.007 and 0.053. Overall, the estimates revealed that genotyping errors were scarce and concerned effectively only one marker.
The analysis revealed that, on average, 64% of pollen gametes came from the background population. In this regard, mother trees showed significant differences, as individual estimates varied between 48% and 79%. Almost 6% of successful pollen was produced by mother trees themselves, resulting in self-fertilization. Thus, 28% of pollen gametes came from local trees due to outcross fertilization. These gametes (roughly 140 seeds) can be considered as an effective sample for estimating parameters related to male reproductive success, that is pollen dispersal kernel and pollen donor fecundity.
The analysis also revealed that pollen gametes that originated in the background population were characterized by considerable genetic divergence levels among mother trees. Individual estimates of divergence parameter ranged from 0.128 to 0.321, with the mean across mother trees of 0.228.
Among pollen dispersal kernel parameters, the mean forward dispersal distance proved difficult to characterize precisely as revealed by the 95% credible interval spanned between 2 and 56,233 m. The posterior median of
equalled 575 m. The posterior distribution of the shape parameter was characterized by the median of 0.073 and the 95% credible interval between 0.010 and 0.466, revealing that the dispersal function was highly fat-tailed. In addition, dispersal function was also highly anisotropic with the prevailing direction of 173° from North clockwise (95% HDP between 130° and 210°), or approximately due South. The median estimate of the rate of anisotropy
equalled 1.005 with the 95% HPD interval between 0.281 and 1.714.
Individual posterior median estimates of male log-fecundity () spanned between −2.92 and 3.85 (Figure 1), with the mean 0.028 and standard deviation 1.219. Although the empirical distribution of posterior medians approached a normal-like distribution (Figure 1), according to the Shapiro–Wilk test, it deviated significantly from normality (p = 0.017), mostly due to a presence of greater variation in the right tail of the distribution. Individual estimates of male fecundity exhibited high variance so that only eight (out of 447) estimates revealed significant departure from zero, according to the estimated 95% credible intervals. In all significant cases, fecundity was greater than zero.


Three phenotypic variables generated eight alternative regression models (including the null model). Under the assumption of nonrandom pollen dispersal, the model with the highest posterior probability of 0.57 contained male strobili abundance and tree height (Figure 2). The second-best model, with the probability of 0.30, contained only tree height as an explanatory variable. The remaining models had a cumulative posterior probability of 0.13. On the other hand, under the random pollen dispersal, the best model, containing tree height only, had the posterior probability of 0.41. The second-best model was the null model with the probability of 0.26.

Based on the best model selected under nonrandom pollen dispersal, tree height had the strongest and positive effect on male fecundity, with the posterior median regression slope = 0.932 (95% HPDI: 0.294–1.712) and the posterior inclusion probability of 0.903. For male strobili abundance, the posterior median regression slope was
= 0.445 (95% HPDI: 0.126–0.732) and the posterior inclusion probability was 0.655. The estimate of the overdispersion parameter was relatively high, with the posterior median of
= 1.243. However, the estimate was characterized by a wide credible interval between 0.640 and 2.008 (see also Figure 3). The two variables explained roughly 50% of the total variance in male fecundity, but the precision of the
estimate was relatively low, as revealed by the empirical estimate of the 95% credible interval between 0.158 and 0.808. For comparison, the estimate of slope for tree height under the second-best model was
1.186 (95% HDPI: 0.444–2.008). However, the second-best regression model was characterized by increased
1.648 (95% HDPI: 0.993–2.470) and diminished
= 0.352 (95% HPDI: 0.037–0.683), compared to the best model.

Based on the best model with random pollen dispersal, tree height had again the strong positive effect on male fecundity with a similar slope of 0.970 (95% HPDI: 0.102–1.924) but much lower posterior inclusion probability of 0.543 compared to the best model under nonrandom pollen dispersal. Moreover, estimates of both the overdispersion (95% HDPI: 1.206–2.296) and, especially, the proportion of explained variance in fecundity
(95% HDPI: 0.000–0.526) revealed that the estimated regression model was characterized by poor quality compared to the regression model under nonrandom dispersal.
Using both the classic neighborhood model and the two-step approach, we selected the same regression model for male fecundity as in the case of the hierarchical approach. The classic model also yielded very similar estimates of slopes, that is = 0.506 (95% CI: 0.366–0.646) and
= 1.307 (95% CI: 0.775–1.840), but characterized by narrower confidence bounds compared to the hierarchical approach. On the other hand, slopes estimated using the two-step approach were significantly lower than the hierarchical approach-based estimates, that is
= 0.097 (95% CI: 0.031–0.127) and
= 0.066 (95% CI: 0.001–0.132). Interestingly, the strength of effects was reversed as compared to the reference estimates.
3.2.2 English yew
The nonrandom pollen dispersal model showed a better predictive fit (WAIC =2,607.9) than the model with random pollen dispersal (WAIC = 2,832.4). Therefore, we focused on the results obtained based on the best-fitting model.
Estimates of genotyping error frequencies ranged from 0.002 to 0.101 with a grand mean of 0.018. In agreement with the earlier study, only one marker (Me-998-304A) showed a non-negligible frequency of genotyping errors of 0.101. Generally, estimates of genotyping errors suggested that genotypic data provided solid information about genealogies.
The posterior median pollen immigration frequency was 0.026 only (95% HPDI: 0.000–0.090), suggesting that a great majority of seedlings had two parents within the study plot. Hence, estimates of parameters of male reproductive success reported below were effectively a function of 117 seedlings.
The posterior median of the mean forward dispersal distance was 2,399 m (95% HPDI: 50–69,081 m), whereas the shape parameter of the exponential-power function was characterized by the posterior median of 0.081 (95% HPDI: 0.022–0.237). In other words, the analysis revealed the pollen dispersal kernel function was strongly fat-tailed.
Estimates of individual male log-fecundity spanned between −1.121 and 2.611 (Figure 4). The empirical distribution of posterior medians departed significantly from the normal distribution, as revealed by the Shapiro–Wilk test (p < 0.001). Only four individual estimates deviated significantly from zero. In all cases, males were characterized by higher fecundity than the mean.


Under the assumption of nonrandom pollen dispersal, the model with a significant effect of trunk diameter on male fecundity had the probability of 0.89 (Figure 2). Thus, despite the high uncertainty of individual fecundity estimates, we treated the null model as very unlikely. On the contrary, under the assumption of random pollen dispersal, the null model was characterized by a slightly higher posterior probability than the alternative model (0.53 vs. 0.47). Consequently, under random dispersal, we could not confirm that the trunk diameter is an important predictor of fecundity.
Under the nonrandom dispersal model, the regression slope for trunk diameter of 0.412 (95% HPDI: 0.102–0.717) pointed at a moderate effect of the variable on male fecundity (Figure 3). Indeed, the variation of individual fecundity estimates around the regression function was non-negligible (Figure 4), with the posterior median of 1.022 and 95% HPD interval between 0.650 and 1.444. The estimate of
was only 14% (95% HPD: 1.6–37.1%), revealing that a majority of individual variation in fecundity remained unexplained.
Under the model with random pollen dispersal, the overdispersion was the only nonzero fecundity hyper-parameter. It was characterized by the posterior median 1.081 and 95% HPD interval between 0.711 and 1.475.
Both the classic neighborhood model and the two-step approach confirmed that male fecundity is associated with trunk diameter. Also, although the estimated slopes were somewhat lower in these additional analyses, that is 0.326 (95% CI: 0.103–0.549) and 0.126 (95% CI: 0.026–0.225) for the classic and two-step approach, respectively, they laid within the credible interval of the slope estimated based on the hierarchical approach.
4 DISCUSSION
In this study, on top of the neighborhood model (Adams & Birkes, 1989; Burczyk et al., 2002; Chybicki & Burczyk, 2013; Klein et al., 2008), we developed a hierarchical model approach to study the effects of phenotypic variables on male reproductive success in plant populations. We showed that our method is characterized by improved statistical properties compared to both the classic neighborhood model (the fixed-effects type of analysis) and the two-step approach, where the estimation of fecundities and the analysis of variable effects are conducted separately. In the presence of overdispersion in male fecundity, the classic approach appeared to be slightly more sensitive to detect moderate-to-weak effects at the cost of elevated risk of false-positive effects. In contrast, the combined two-step approach was characterized by largely biased slope estimates though it quite correctly identified the true regression model.
The simulations revealed that a sample of 200 progeny individuals is sufficient to correctly determine the relationship between phenotypic variables and male reproductive success. Doubling the number of progeny resulted in a reduced variance but did not influence much sensitivity and specificity of regression parameter estimators. However, the sampling effort should be considered in the context of the effective sample size, that is after accounting for pollen immigration. Pollen immigration is routinely observed in natural populations, and immigration rates vary greatly between populations and species (Ashley, 2010; Ellstrand, 2014). Numerous factors shape the observed level of immigration, including the distance to nearest pollen sources, pollen dispersal mechanism (e.g. wind- vs. animal-mediated), life form (herbs vs. trees) as well as study plot characteristics such as dimensions, shape or density (Ashley, 2010; DiLeo et al., 2018; Ellstrand, 2014). Pollen immigration can also vary at the individual level, for example due to the location of a mother plant in regard to plot boundaries (Chybicki & Burczyk, 2013). Therefore, at the research design stage, it is crucial to consider the negative impact of pollen immigration on the effective number of progeny. Assuming that pollen immigration is 50%, the minimum recommended sample size is 250–300 seeds. However, for species characterized by extensive pollen flow, the effective sample size may be as low as ≤20% of the total sample size only (e.g. Ortego et al., 2014). In such cases, ≥600 seeds should be sampled. Excessive self-fertilization can be an additional limiting factor of the effective sample size, especially in species revealing high variation in self-fertilization rates either among populations or individuals (Setsuko et al., 2013; reviewed in Whitehead et al., 2018). It is worth noting that the variation in self-fertilization rates can be explained with ecological variables using an analogous approach (Chybicki et al., 2019). Combining the two methods allows determinants of the observed selfing and outcrossing patterns to be fully characterized.
The number of candidate fathers is equally as important as the number of offspring for the estimation of regression parameters. It stems from the fact that the estimation requires a sufficient number of data points at the higher hierarchy level. A sample of 100 adults was shown to satisfy the sample size demand, whereas a sample of 50 individuals appeared to negatively influence the sensitivity in regard to low-to-moderate effects. This finding points at potential difficulties with applying the method to plants that form small populations, such as threatened species (Wade et al., 2016).
Interestingly, the number of mother plants seems to be of little importance for correct identification of the effects of phenotypic variables. Nonetheless, the simulations represented quite an idealistic situation compared to real populations. In real populations, for example, pollen dispersal may be very limited (Degen et al., 2004; Lloyd et al., 2018; Moracho et al., 2016). Under such conditions, sampling of a limited number of families may influence the phenotypic spectrum of successful pollen donors leading to either false discoveries or false-negative effects. Because the number of progeny per family appeared to be unimportant for the quality of regression parameter estimates, if a study is focused on identifying phenotypic determinants of male reproductive success, it is advisable to sample as many mothers as possible even at the cost of low progeny number per tree, as in the yew case study. Such a sampling scheme is in line with earlier suggestions (Chybicki et al., 2019; Koelling et al., 2012). On the other hand, if individual variation in selfing or pollen migration rates is of interest, one should seek a compromise between the number of offspring per family and the total number of families. It seems that 10–15 mother plants and about 20 seeds per mother would satisfy both deals.
Comparison of the hierarchical neighborhood model approach and the two-step approach showed that the relatively high estimation efficiency of the former method results from the integration of genetic and nongenetic data to estimate fecundities. The strategy of combining different data types to increase the power of a statistical procedure is well known in molecular ecology (Chybicki et al., 2019; Gaggiotti et al., 2004; Guillot et al., 2012; Okuyama & Bolker, 2005). Also, the classic neighborhood model can be thought of as another example of a combined data approach. Therefore, the hierarchical neighborhood model is not truly novel in this context. However, the proposed approach adjusts the existing model to account for the presence of excessive variation in individual fecundity that goes significantly beyond the fixed-effects model. It should be stressed that the success of the hierarchical method relies on properly selected covariates of reproductive success. In plants, male fecundity is expected to correlate with flowering intensity (Jong & Klinkhamer, 2005). The abundance of male flowers (or equivalents) translates directly into an abundance of produced pollen. In addition, in animal-pollinated plants, the flower abundance attracts pollinators and increases pollen transfer (Glaettli & Barrett, 2008). Also, floral characters (floral display, floral colour, nectar and scent) may be important for male success because they influence visitation frequency and foraging time of pollinators, and indirectly pollen dispersal between plants (Abraham, 2005; Huang et al., 2006; Yan et al., 2016). However, flower traits are often difficult to quantify, especially if there are many candidate parents within a study plot or if a study species is characterized by large dimensions, as many trees. Fortunately, flower abundance usually covaries with individual size, making size-related phenotypic variables a right choice for covariates of reproductive success (Avanzi et al., 2020; Younginger et al., 2017). In trees, trunk diameter is the easiest and the most accurately measurable predictor of this kind. Therefore, it should be, and usually is, a variable of the first choice if flowering observations are not available. The analysed examples of spruce and yew data can serve as a good illustration of how important is the selection of phenotypic variables. In the spruce case study, explanatory variables included traits that have a priori high likelihood of being predictors of male fecundity, and two out of three variables appeared to have a significant effect. As a result, the estimated regression model explained about 50% of the total variation in fecundity. In the yew case study, on the contrary, only a trunk diameter was included. Although it appeared to be significantly related to male reproductive success, a variation in trunk diameter explained only 14% of the total variation in male fecundity. Interestingly, in both cases, the level of overdispersion was still relatively high, suggesting that male fecundity is shaped by additional unmeasured factors, such as flowering time, genetic compatibility between plants as well as inbreeding and outbreeding depression that may lead to early-stage mortality of seeds and, consequently, to an un-intended ascertainment bias in the sample of progeny.
The relevant question remains whether the classic neighborhood model approach is still useful. Clearly, the hierarchical neighborhood model offers better estimates of phenotypic effects than the classic neighborhood model. On the other hand, the classic model is more straightforward and allows the maximum-likelihood procedure to be used for parameter estimation. Therefore, the classic approach is more time-efficient than the hierarchical approach, which relies on the MCMC algorithm. This appears to play an important role in the validation of the method using computer simulations. Also, due to higher sensitivity, the classic method allows detecting effects based on a small effective sample of progeny. It can be important in the case of species characterized by high pollen dispersal capacity that makes the effective progeny number much lower than the census number of progeny in a sample. However, since the classic method can detect false-positive effects in the face of overdispersion related to omitted factors, the results should always be considered with caution, especially when they point at counterfactual or hard to explain relationships. More importantly, the classic neighborhood model remains useful for assessing the ongoing gene flow (i.e. dispersal kernel, immigration). In this case, in order to yield unbiased gene flow parameters (Burczyk & Chybicki, 2004), the phenotypic variables need to be added to account for the actual, often size-dependent, distribution of pollen productivity (analogously to the seed shadow model; Clark et al., 1999), especially when plant sizes exhibit large variation.
Our simulations, as well as case studies, suggested that the approach relying on separate estimation for male fecundity and the effects of phenotypic variables requires a substantial amount of paternity assignments in order to eventually provide accurate estimates of regression slopes. Therefore, it is recommended to sample a large number of progeny (as in Klein et al., 2008 or Chybicki & Burczyk, 2013) to saturate the mating model with informative data. Otherwise, fecundity estimates may tend to reflect priors, and the estimates of regression slopes may become severely biased or even meaningless. Interestingly, the dependency of fecundity estimates on priors can also be noted in the hierarchical neighborhood model approach, as clearly shown in the spruce case study. Inspection of Figure 1d reveals that low fecundity estimates tended to largely reflect their prior distributions (i.e. the estimated regression model). However, because the priors are saturated with phenotypic data, the fecundities are estimated accurately, of course, as long as the underlying regression model is identified correctly. Nonetheless, despite that the two-step approach is characterized by a high sampling demand, it still offers some advantages. While the regression model used in the hierarchical approach is relatively simple, the actual relationship between phenotypic variables and male reproductive success may need more complex patterns to account for. Inequality of variance over the range of individual fecundities seems to be an important statistical phenomenon not implemented in the hierarchical neighborhood model. Although the heteroscedasticity does not introduce any bias to slope coefficients (White, 1980), it influences the variances of the estimates. Consequently, the statistical power of variable selection may be reduced (Cleasby & Nakagawa, 2011). With the two-step approach, advanced techniques of regression analysis can be easily implemented to overcome this problem. For this reason, the hierarchical neighborhood model approach and the two-step approach can be considered as complementary methods, keeping in mind significant differences in data demand.
In our simulations and the case studies, we considered a somewhat limited number of potential determinants of reproductive success so that we observed no problems with identifying the best model, suggesting that the designed RJMCMC algorithm efficiently explored the model space. However, in the case of large number (e.g. dozens) of explanatory variables, it may be advisable to restrict the model space to the most promising variables which can be preselected in the initial run as components of the median probability model, or the variables for which the inclusion probability is ≥0.5 (Barbieri & Berger, 2004), as suggested in our earlier work (Chybicki et al., 2019).
Our case studies provided a good illustration of the applicability of the hierarchical neighborhood model to real data. Here, we showed that male reproductive success in spruce tends to increase with both tree height and male strobili abundance while it is not related to crown volume. Interestingly, tree height appeared to be the most important factor associated with fecundity. The slope close to unity indicates that the unit increase in tree height (1 unit = 1 standard deviation of the trait) gives approximately the unit increase in male relative log-fecundity. For comparison, the effect of male strobili abundance was only half of that of tree height. The relatively higher importance of tree height on male fecundity is somewhat unexpected given that male strobili abundance is directly correlated with the amount of produced pollen. This finding may reflect differences in measurement accuracy between the two traits. On the other hand, wind-pollinated trees characterized by higher stature are expected to shed pollen more effectively than shorter ones due to better exposition to wind (Di-Giovanni & Kevan, 1991; Tackenberg, 2003). The empirical evidence from the study of mating patterns in Pinus attenuata (Burczyk et al., 1996) provides good support for this expectation. Hence, our results are in line with earlier studies (see Petit & Hampe, 2006) that, in wind-pollinated plants, relative male reproductive success depends on both the relative amount of produced pollen and the relative ability to spread pollen.
The yew case study confirmed our previous finding (Chybicki & Oleksa, 2018), based on the classic neighborhood model, that male trees with greater trunk diameter tend to have higher reproductive success. However, with the new method, unlike the classic approach, we were able to show how individual trees deviate from the general pattern. The dispersion of fecundities around the regression line (Figure 4d) suggests that additional factors must shape the observed male fecundity. A future study could focus on the explanation of the presence of several male trees characterized by the fecundity above the predictions based on the regression model, primarily because such a high variation in reproductive success can be related with the ongoing adaptation (Gerzabek et al., 2017). Given that the study population is close to the margin of natural species distribution, the identification of determinants of the excessive variation in male fecundity could be important both in the context of gene pool conservation and for the understanding of mechanisms behind the natural selection on male strobili traits (Mayol et al., 2020).
The application of the hierarchical neighborhood model approach to real data also showed that ignoring nonrandom pollen dispersal may lead to false-negative results, especially when the magnitude of a variable effect is moderate-to-low. Only strong effects seem to remain relatively robust to the assumption about random vs. nonrandom dispersal. We should underscore that the observation is not new, as mentioned in the introduction. However, the analyzed case studies made an excellent opportunity to emphasize the importance of ignored pollen dispersal in studies of determinants of reproductive success. It must be said that the problem is expected when the observed parentage counts are regressed on measured phenotypes because such counts are a function of both pollen dispersal and fecundity, while only the former factor is accounted for in such analyses. Consequently, future studies should rely on well-tested approaches, such as the hierarchical neighborhood model developed in this study.
5 CONCLUSIONS
Here, we showed that the explicit accounting for overdispersion in male fecundity improves the neighborhood model's performance, reducing significantly the frequency of false discovery of the effects of phenotypic variables. The important refinement, as compared with the classic neighborhood model, is also the ability of the hierarchical approach to quantify the proportion of the total variation in fecundity that is explained by the estimated regression model. Using our approach, it is now easy to assess whether additional factors may play a role in determining reproductive success. We believe this feature may be stimulating for future studies on mating patterns as well as on sexual selection in plants (Lankinen & Green, 2015).
ACKNOWLEDGEMENTS
The study was supported by the National Science Centre, Poland (the grant UMO-2018/31/B/NZ8/01808 to IJC) and Poznań University of Life Sciences (MD).
AUTHOR CONTRIBUTIONS
I.J.C. developed the statistical approach and ran simulations. A.O. and M.D. performed sampling, made ecological measurements and genotyped the case study populations. I.J.C. analysed the empiric data and wrote the first draft. All the authors contributed to the final version of the manuscript.
Open Research
DATA AVAILABILITY STATEMENT
- Sampling locations, morphological data and microsatellite genotypes of Picea abies and Taxus baccata: The Dryad repository (10.5061/dryad.51c59zw72).
- The Windows computer program NM2F implementing the hierarchical neighborhood model is available at https://www.ukw.edu.pl/pracownicy/plik/igor_chybicki/1806/.