Volume 30, Issue 8 pp. 1450-1477
Target Review
Free Access

Interpreting the genomic landscape of speciation: a road map for finding barriers to gene flow

M. Ravinet

Corresponding Author

M. Ravinet

Centre for Ecological and Evolutionary Synthesis, University of Oslo, Oslo, Norway

National Institute of Genetics, Mishima, Shizuoka, Japan

Correspondence: Mark Ravinet, CEES, Department of Biosciences, University of Oslo, P.O. Box 1066 Blindern, NO-0316 Oslo, Norway.

Tel.: +47 22 85 50 65; fax: +47 22 85 40 01; e-mail: [email protected]

Search for more papers by this author
R. Faria

R. Faria

CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO, Laboratório Associado, Universidade do Porto, Vairão, Portugal

Department of Experimental and Health Sciences, IBE, Institute of Evolutionary Biology (CSIC-UPF), Pompeu Fabra University, Barcelona, Spain

Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK

Search for more papers by this author
R. K. Butlin

R. K. Butlin

Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK

Department of Marine Sciences, Centre for Marine Evolutionary Biology, University of Gothenburg, Gothenburg, Sweden

Search for more papers by this author
J. Galindo

J. Galindo

Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain

Search for more papers by this author
N. Bierne

N. Bierne

CNRS, Université Montpellier, ISEM, Station Marine Sète, France

Search for more papers by this author
M. Rafajlović

M. Rafajlović

Department of Physics, University of Gothenburg, Gothenburg, Sweden

Search for more papers by this author
M. A. F. Noor

M. A. F. Noor

Biology Department, Duke University, Durham, NC, USA

Search for more papers by this author
B. Mehlig

B. Mehlig

Department of Physics, University of Gothenburg, Gothenburg, Sweden

Search for more papers by this author
A. M. Westram

A. M. Westram

Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK

Search for more papers by this author
First published: 08 August 2017
Citations: 350
Data deposited at Dryad: doi: 10.5061/dryad.qj5cr.

Abstract

Speciation, the evolution of reproductive isolation among populations, is continuous, complex, and involves multiple, interacting barriers. Until it is complete, the effects of this process vary along the genome and can lead to a heterogeneous genomic landscape with peaks and troughs of differentiation and divergence. When gene flow occurs during speciation, barriers restricting gene flow locally in the genome lead to patterns of heterogeneity. However, genomic heterogeneity can also be produced or modified by variation in factors such as background selection and selective sweeps, recombination and mutation rate variation, and heterogeneous gene density. Extracting the effects of gene flow, divergent selection and reproductive isolation from such modifying factors presents a major challenge to speciation genomics. We argue one of the principal aims of the field is to identify the barrier loci involved in limiting gene flow. We first summarize the expected signatures of selection at barrier loci, at the genomic regions linked to them and across the entire genome. We then discuss the modifying factors that complicate the interpretation of the observed genomic landscape. Finally, we end with a road map for future speciation research: a proposal for how to account for these modifying factors and to progress towards understanding the nature of barrier loci. Despite the difficulties of interpreting empirical data, we argue that the availability of promising technical and analytical methods will shed further light on the important roles that gene flow and divergent selection have in shaping the genomic landscape of speciation.

Introduction

Speciation is the evolution of reproductive isolation between populations. This process is often continuous and complex, involving the evolution of multiple, interacting reproductive barriers among populations that do not necessarily affect patterns of variation across the whole genome at once. Since Darwin first discussed the concept of speciation, huge progress has been made in identifying the main reproductive barriers at the phenotypic level for a large number of taxa (Coyne & Orr, 2004). However, our understanding of the genetic basis of these barriers, and the genomic patterns associated with their evolution, has remained limited until recently. Over the last decade, advances in sequencing technology have offered an unprecedented opportunity to overcome this hurdle and to investigate the genetic architecture of reproductive isolation across the entire genome and across the speciation continuum (Seehausen et al., 2014; Wolf & Ellegren, 2016). However, our understanding of the links between patterns of genomic differentiation/divergence (defined in Box 1), phenotypes and reproductive isolation is incomplete. In this review, we highlight the potential and the challenges of using genomic data, alongside other sources of evidence, to understand the evolutionary processes that shape the ‘genomic landscape’ of differentiation and speciation, and to identify barriers to gene flow.

Box 1. Clearer definitions

A wealth of technical terms, often without clear definition, makes any attempt to understand the literature on speciation genomics a daunting task (Harrison, 2012). In this review, we argue for the importance of identifying barrier loci, positions in the genome that contribute to barriers to gene flow between populations. These include loci under divergent ecological selection, but also loci involved in other barriers, for example mate choice, or intrinsic post-zygotic isolation, that could be neutral within populations. When a locally beneficial allele at a divergently selected locus arises (adaptive in one population but disfavoured in the other – i.e. a fitness trade-off), divergent positive selection may cause it to increase in frequency, resulting in a local selective sweep. The barrier effect is a reduction of effective migration rate relative to actual migration between populations that occurs at the barrier locus (i.e. the direct effect) but can also extend beyond it (i.e. the indirect effect). In surrounding genomic regions, the barrier effect will initially allow a build-up of genetic differentiation, that is a difference in allele frequency, between populations, typically documented using a relative measure such as FST. Over time, the barrier effect will allow neutral mutations to establish, resulting in genetic divergence between populations, typically measured using dXY. Genome scans, comparisons between populations or species at multiple loci across the genome (typically thousands of loci), can quantify the genomic landscape of differentiation and divergence when placed on a physical or genetic map (although heterogeneity and differentiation/divergence can still be studied without a map). These are used to identify outliers, that is loci or regions that fall outside the expected neutral distribution of differentiation or divergence, which may be influenced by barrier effects. Not all locally adaptive alleles are affected by divergent selection and generate a barrier to gene flow; some may be beneficial in one population and neutral in the other. These will increase in frequency more rapidly where they are adaptive, but may then spread through the other population too as they are not selected against. Finally, globally beneficial alleles will increase in frequency due to positive (but crucially, not divergent) selection and spread among populations connected by gene flow. Both globally and locally adaptive alleles may undergo hard sweeps, where the adaptive allele is present on a single genetic background when selection starts (e.g. due to novel mutation), or soft sweeps, where the selected allele is present on multiple genetic backgrounds (e.g. due to standing genetic variation).

Recent attempts to identify loci involved in reproductive isolation, that is barrier loci (see Section 2 and Box 1), from high-density genetic data have largely centred on bottom-up genome-scan approaches (sensu Barrett & Hoekstra, 2011). Regions of high genomic differentiation (‘outlier loci’, typically measured using FST) are often assumed to have arisen due to reproductive barriers, while homogenizing gene flow decreases differentiation elsewhere in the genome. In agreement with classic hybrid zone research (Barton & Bengtsson, 1986; Harrison, 1990; Vines et al., 2003), initial genome scans revealed compelling evidence of genomewide heterogeneity in differentiation between populations, ecotypes and species (Nosil et al., 2009). Whereas early genome scans had limited resolution, and the genomic distribution of the loci under divergent selection was mostly unknown, cheaper genome sequencing and more streamlined genome assembly pipelines are now overcoming these initial limitations. As a result, accumulating genomic data have started to reveal patterns of heterogeneity in a wide variety of nonmodel organisms at different stages of divergence (Table 1).

Table 1. Examples of systems where evidence of heterogeneous genomic differentiation or divergence has been identified using a genome-scan approach. Note that in many cases, additional evidence has also been used to demonstrate the role of selection and this table is not intended to be an exhaustive summary
Study system NGS approach Genome-scan approach Main findings Reference
European rabbit subspecies: Oryctolagus cuniculus cuniculus & O. c. algirus Target sequencing, RNA-seq Sliding window estimates of FST, dXY, RND, number of fixed differences and ratio of fixed differences to shared polymorphism Low-to-moderate genomewide mean FST; numerous regions of high differentiation. Overrepresentation on X chromosome and centromeres. Sweeps do not account for majority of differentiation peaks. Carneiro et al. (2014)
Fruit fly subspecies: Drosophila pseudoobscura pseudoobscura & D. persimilis Whole-genome sequencing Sliding window estimates of dXY and nucleotide diversity. High nucleotide diversity and divergence in inversions compared to collinear regions due to reduced recombination McGaugh & Noor (2012)
Sunflowers: Helianthus annuus & H. petiolaris RNA-seq Sliding window and spatial autocorrelation statistics based on FST Lower overall differentiation in sympatry, number and size of genomic islands did not differ with geography, strong negative correlation with recombination regardless of spatial context Renaut et al. (2013)
Marine and freshwater three-spined stickleback ecotypes: Gasterosteus aculeatus Whole-genome sequencing Sliding window estimates of FST, nucleotide diversity and Hidden Markov model detection of outlier regions Multiple genomic regions of high differentiation across genome. Evidence of parallel reuse of standing variation in different populations. Jones et al. (2012)
Walking stick ecotypes: Timema cristinae Whole-genome sequencing Point estimates of FST at SNP positions, HMM detection of outlier regions Mean FST greater between geographically separated populations compared to adjacent; 8–30% genome highly differentiated; most regions not shared among population pairs. Soria-Carrasco et al. (2014)
Hawthorne and apple maggot fly ecotypes: Rhagotelis pomonella Microsatellites and allozymes Point estimates of FST Two genomic outlier regions on separate chromosomes, suggesting support for genomic island and continent hypotheses. Michel et al. (2010)
Normal benthic and dwarf limnetic whitefish ecotypes: Coregonus clupeaformis RAD sequencing Sliding window estimates of FST, barrier strength (m/me) and extended haplotype homozygosity Positive correlation of mean and variance of FST, outlier region size and LD with morphological differentiation. Island size influenced by LD, selection strength and demography. Incomplete parallelism of outliers. Gagnaire et al. (2013)
Annual and perennial yellow monkeyflower ecotypes: Mimulus guttatus Genotyping-by-sequencing Sliding window estimates of FCT (from hierarchical AMOVA to account for population level variation), nucleotide diversity and divergence (dXY) Outlier regions distributed across genome, but enriched in an inversion with barrier loci. Colinear regions probably homogenized by gene flow. Twyford & Friedman (2015)
M and S mosquito forms/species: Anopheles gambiae SNP-genotyping array Sliding window estimates of nucleotide diversity and FST Regions of high differentiation at centromeres, low polymorphism and recent sweeps close to centromeres Neafsey et al. (2010)
Neotropical butterfly species: Heliconius melpomne, H. cydno & H. timareta Whole-genome sequencing Sliding window estimates of FST and four population tests for gene flow Lower FST between sympatric species compared to allopatric, higher differentiation and lower gene flow on Z chromosomes and at loci underlying divergent wing patterns Martin et al. (2013)
Flycatchers: Ficedula sp. Whole-genome sequencing Nonoverlapping window estimates of FST, dXY and paired branch statistics. Strong correlations between FST among independent species comparisons and with recombination rate suggests heterogeneity caused by background selection Burri et al. (2015)

Despite progress in documenting patterns, interpreting the peaks and troughs of differentiation in genome-scan data has not been as straightforward as initially assumed (Fig. 1). This has caused problems for researchers hoping to use genome scans to identify signatures of local adaptation (Hoban et al., 2016) and barriers to gene flow during speciation (Noor & Bennett, 2009; Cruickshank & Hahn, 2014). There are several reasons for these difficulties. Firstly, peaks of high differentiation are produced in diverging populations without gene flow as a result of background selection and selective sweeps after isolation (Charlesworth et al., 1993; Noor & Bennett, 2009; Cruickshank & Hahn, 2014; Burri et al., 2015). Although some of these peaks may indicate loci that become barrier loci after contact, many other peaks do not. Instead, they may reflect sweeps of universally or transiently adaptive alleles or even differentiation by drift. Therefore, the effects of barrier loci can be clearly identified only when they have actually recently acted to prevent gene flow in nature (Harrison & Larson, 2016; Marques et al., 2016; McGee et al., 2016). Nevertheless, allopatric divergence remains important for understanding genomewide heterogeneity in the absence of gene flow, for example using wild or laboratory-based crosses to identify barrier loci. Tests for ongoing or recent gene flow are a crucial prerequisite for the identification of barrier loci from genome scans. Secondly, patterns of FST (or other differentiation and divergence measures) are influenced by multiple factors that vary across the genome, including mutation, demographic history, genetic drift, selection, gene flow, recombination, gene density and genome architecture (Figs 1 and 2); and some of these factors are expected to change during different stages of speciation. From the speciation perspective, the principal objective is to infer the number, distribution and strength of barriers to gene flow, as well as their influence on other genomic regions. However, extracting this signal from genome-scan data in the presence of so many other processes remains challenging.

Details are in the caption following the image
Different demographic histories, genome features and processes can produce apparently equivalent landscapes of differentiation. During primary divergence, barrier loci and their barrier effects increase relative differentiation (shown here as FST). Regions of reduced recombination may also give rise to similar peaks because background selection against deleterious mutations reduces within-population diversity – although the effects of this in the face of gene flow are not well understood. Under a secondary contact scenario, differentiation first builds up over time due to drift during a period of geographical isolation. Globally adaptive alleles may also sweep to fixation in one population but not the other during this isolation, potentially resulting in transient peaks of differentiation upon contact. Eventually gene flow between populations erodes differentiation outside barrier loci and regions with increased background selection.
Details are in the caption following the image
Features contributing to the genomic landscape. External processes (blue boxes) such as divergent selection can generate peaks of relative differentiation (shown here as FST). Simultaneously, the genome background is homogenized by gene flow. Background selection in regions of reduced recombination rate and shared ancestral polymorphism can also generate this pattern – although in reality the genomic landscape is likely fashioned by a combination of both. Genome features (shown in red boxes) also influence the efficacy and extent of external processes as well as interacting with one another. Finally, both the effects and extent of both external processes and inherent properties of the genome are shaped by the influence of demographic and evolutionary history.

Starting with the premise that identifying barrier loci is a major objective of speciation research, our aim with this target review is to clarify what we can expect to learn from population genomic data and genome scans, specifically in examples of speciation involving periods with gene flow. Our purpose is not to provide an exhaustive review of the estimation methods available to researchers (see Wolf & Ellegren, 2016 for an excellent summary) but rather to take a broader perspective on the genomic landscape and the processes that shape it. We start by describing the expected patterns of local and genomewide differentiation generated by barrier loci in idealized scenarios. We then consider how these patterns might be modified by a series of complicating factors, primarily demographic history and nonuniformity of the genome with respect to mutation, recombination and background selection (Fig. 2). These may obscure real signatures of divergent selection and gene flow or create spurious patterns that are false positives (Box 2). We argue that it is essential to account for these factors in order to identify features of the genomic landscape related to barrier effects and so critical for the speciation process. We end with a road map suggesting ways in which to put inferences from the genomic landscape into context by combining them with other sources of evidence, such as experiments or measures of selection and ancestry that move beyond allele frequency differences, to gain further insight into the speciation process.

Box 2. Searching for islands in a sea of metaphors

Genomic differentiation may be heterogeneous during much of the speciation process (Nosil, 2012; Table 1). Under the genic view of speciation, the genome is porous to gene flow while reproductive isolation is incomplete (Wu, 2001; Wu & Ting, 2004). A large number of genome scans have identified distinct genomic regions (‘islands’) of greater differentiation than the putatively neutral genomic background (‘sea level’) that tends towards homogenization by gene flow (Nosil et al., 2009). First described as ‘genomic islands of speciation’ in Anopheles mosquitoes, these regions were assumed to harbour loci underlying reproductive isolation (Turner et al., 2005). The genomic island metaphor has proven popular and has undoubtedly been valuable for driving empirical progress; a wide array of studies searching for ‘speciation islands’ in multiple taxa have been published in the last decade.

Other terms have also been coined to describe genomic heterogeneity. These may not explicitly invoke speciation, for example genomic islands of differentiation’ (Harr, 2006) or ‘genomic islands of divergence’ (Nosil et al., 2009). Large-differentiation regions, potentially containing multiple speciation genes, have been referred to as ‘continents of divergence’ (Michel et al., 2010; Egan et al., 2015). These metaphors have led to conceptual frameworks, such as Feder et al.'s (2012a) four-phase model, which incorporate processes such as divergence and genome hitchhiking (see main text) to explain how differentiation across the genome evolves. Although the metaphors have proven useful for describing observed patterns and communicating a complex concept to a wider audience, introducing attractive terminology runs the risk of encouraging ambiguity (Harrison, 2012). For example, differentiation is more likely to vary continuously during speciation rather than showing clearly defined ‘islands’ or ‘continents’. Metaphors, although informative and inspiring can also potentially lead to arbitrary and unproductive discussions among researchers on how to define them: what level of differentiation defines an island and when or at what length does an ‘island’ become a ‘continent’?

Although genomic regions of high differentiation undoubtedly exist (Table 1), they are not necessarily caused by the interplay between gene flow and divergent selection; they may in fact be ‘incidental islands’ that emerge when gene flow is absent (Noor & Bennett, 2009; Turner & Hahn, 2010; Cruickshank & Hahn, 2014). Divergent and indirect selection (i.e. hitchhiking and background selection) can reduce within-population diversity in geographically isolated and potentially locally adapted populations, leading to high-FST regions that may not be related to speciation, whereas much of the genome remains undifferentiated due to incomplete lineage sorting. This process results in a specific genomic signature with high levels of differentiation (FST, a relative measure) and low levels of absolute divergence (dXY) at loci affected by local adaptation or background selection. In this case, divergence due to direct and indirect selection occurs in the absence of gene flow, potentially after speciation is completed or even just while local adaptation is occurring. It is necessary to rule out this alternative explanation before interpreting regions of elevated differentiation as barrier loci. For that, it is crucial to test for ongoing or recent gene flow (Box 3).

Box 3. Inferring and quantifying gene flow

Barrier effects can only be detected in the presence of ongoing gene flow. Inferring gene flow, preferably in the context of evolutionary history, is an important first step for interpreting the genomic landscape of speciation. However, given the complexity of speciation history and the high probability that, in many cases, gene flow is not constant over time, this presents a major difficulty for speciation research.

Identifying recent gene flow using population clustering methods that reliably detect F1, F2 and backcross hybrids is relatively straightforward (Pritchard et al., 2000; Anderson & Thompson, 2002; Falush et al., 2003). Emphasis should be placed on identifying introgression over several generations: i.e. the presence of backcrossed individuals. Clinal analysis of allele frequencies across hybrid zones or across the genome overcomes a significant current disadvantage of clustering techniques as it allows for reliable migration estimates (Barton, 1983; Barton & Hewitt, 1985; Gompert & Buerkle, 2011). Inference of locus-specific ancestry across the genome using unadmixed reference populations can also be informative in admixed populations at hybrid zones (Hoggart et al., 2004; Price et al., 2009; Churchhouse & Marchini, 2013). Other evidence for recent or ongoing gene flow makes use of the biogeographical distributions of species, for example asking if genetic differentiation is lower in sympatry compared to allopatry (Noor & Bennett, 2009; Marques et al., 2016). For example, Heliconius butterfly studies show greater divergence between allopatric races than between those in sympatry or parapatry, suggesting ongoing gene flow (Nadeau et al., 2012, 2013; Martin et al., 2013). Similarly, very recently diverged populations (i.e. hundreds of generations) with documented hybridization events suggest low genomic differentiation is maintained, at least in part, by gene flow (Lescak et al., 2015; Marques et al., 2016). Finally, nongenetic evidence of migration or potential migration between populations using mark–recapture experiments (Bolnick et al., 2009), mate-choice experiments (Nosil et al., 2002; McKinnon et al., 2004) and phenotypic variation (Lescak et al., 2015) can bolster the argument that low background differentiation in a genome scan is due to ongoing gene flow.

Several key approaches incorporate demographic history, making it possible to infer both gene flow and mechanisms of divergence (Sousa & Hey, 2013). Site frequency spectrum (SFS) methods can rapidly approximate the joint allele frequency distribution between populations, allowing comparisons of divergence with and without gene flow and the estimation of migration rate (Gutenkunst et al., 2009; Excoffier et al., 2013). Isolation-with-migration (IM) models have also recently been extended to incorporate whole-genome data and overcome some simplifying assumptions such as absence of recombination (Hobolth et al., 2011a; Mailund et al., 2012). Approximate Bayesian Computation (ABC) is more computationally expensive but can incorporate thousands of loci resulting in high-precision parameter estimation (Robinson et al., 2014; Shafer et al., 2015). ABC is flexible, allowing variation in migration rates among loci to be incorporated (Roux et al., 2013, 2014), together with variation in the rate of drift (Roux et al., 2016) or the inclusion of haplotype-based statistics for estimating gene flow (Bertorelle et al., 2010; Csilléry et al., 2010). However, we note that model-based inference is limited to distinguishing among the models tested. All models are ‘wrong’ in the sense that they are simplifications, but comparisons between them may still be informative. However, wrong conclusions may be drawn if no model sufficiently close to the ‘truth’ is included in the first place. Additionally, parameter estimates are most meaningful in the context of a specific model and must therefore also be interpreted with caution.

Modelling approaches typically perform poorly when estimating gene flow timing (Roux et al., 2013), but this may be possible to overcome when there is sufficient biogeographical and phylogenetic information to resolve periods of contact between populations (Garrigan et al., 2012; Nadachowska-Brzyska et al., 2013). This is the rationale behind comparative statistics such as ABBA-BABA that test for an excess of derived alleles at positions across the genome (Green et al., 2010; Durand et al., 2011; Martin et al., 2014). By incorporating different taxa with known divergence times, it is possible to infer the time interval when introgression may have occurred (Martin et al., 2013; Eaton et al., 2015). Perhaps the greatest promise for accurately inferring the timing and extent of migration comes from methods which use the distribution of introgressed ‘block’ sizes in the genome (Baird et al., 2003). One such method compares the sizes of introgressed haplotypes (‘migrant tracts’) to an expected distribution under migration within T generations; however, this has little power for dating admixture that has occurred more than 1000 generations in the past (Pool & Nielsen, 2009; Liang & Nielsen, 2014). Identity-by-state tracts, that is the distance between polymorphisms on a haplotype, provide a promising means for estimating the timing and extent of gene flow, as well as other demographic parameters (Harris & Nielsen, 2013). Nonetheless, both methods require accurate phasing. An extension of the Markov coalescent approach for estimating effective population size as a function of time can now use haplotype data from multiple individuals to determine cross-coalescence rate (i.e. coalescent events within and between populations) providing accurate estimates of the timing and rate of last migration without a specified demographic model (Li & Durbin, 2011; Schiffels & Durbin, 2014).

Importantly, even if gene flow does occur, elevated divergence/differentiation alone is not sufficient to identify barrier loci; additional evidence is necessary (see Section 3: Road map). Given that ‘islands’ may not be involved in speciation at all, we suggest avoiding any terminology linking highly differentiated genomic regions to speciation unless further evidence suggests this is, in fact, the case.

Section 1: Barriers to gene flow in the genomic landscape

Barrier loci and barrier effects

We define barrier loci as positions in the genome that contribute to a reduction in effective migration rate (me) relative to the expected rate given the proportion of individuals moving between diverging populations, that is loci that contribute to a barrier to gene flow (see also Box 1). These loci may act independently or interact with one another, and the extent of interaction may vary as speciation proceeds. Barrier loci may involve single nucleotide substitutions or other types of mutation such as indels (Chan et al., 2010; Phadnis et al., 2015), or chromosomal rearrangements, for example inversions. These variants may be neutral within populations, for example genomic incompatibilities evolving via drift, or they may be under selection unrelated to the environment, for example meiotic drive (Presgraves, 2007). Barrier loci include loci under divergent selection, either ‘ecological’ (Nosil, 2012) or due to reinforcement (Butlin, 1987; Servedio & Noor, 2003). Alleles at barrier loci may be pleiotropic, affecting multiple barrier traits simultaneously, or they may influence multiple-effect traits (Servedio et al., 2011; Smadja & Butlin, 2011), in either case potentially generating a strong reduction in gene flow, that is a strong barrier effect (see Box 1). We note that in some cases, a barrier to gene flow may not necessarily require allele frequency differences at the barrier locus at all, as in one-allele models (Felsenstein 1981, Servedio 2000). Such barriers likely show different genomic patterns and may not be detectable in standard genome scans; as such, they are beyond the scope of this review.

Divergent selection on a barrier locus implies a fitness trade-off – that is, with two alleles at a locus and two populations, one allele has a higher fitness in the first population and a lower fitness in the second, and vice versa for the other allele. This reduces effective migration and facilitates stable differences between populations. However, in order for an allele at a barrier locus under divergent selection to spread and contribute to a barrier effect in the long term, selection locally favouring this allele must be strong enough to overcome the opposing effect of gene flow (Haldane, 1930; Slatkin, 1985, 1987). In small populations, the efficacy of selection is reduced by greater drift, and stronger selection is sometimes needed to reach a given degree of differentiation (Yeaman & Otto, 2011). The distribution of barrier locus effect sizes in a given case study is therefore likely to depend on both effective population size (Ne) and migration (m). For large populations with strong extrinsic barriers to the exchange of individuals, barrier effect sizes should vary over a wide range, whereas in small populations exchanging many migrants, only large-effect barrier loci are expected (Yeaman & Whitlock, 2011). The distribution is also expected to vary with progression towards speciation and demographic history; small-effect alleles may be more common during periods of geographical isolation than during contact. The effect-size distribution of barrier loci remains elusive because although theoretical work shows that even alleles under very weak selection may temporarily contribute to phenotypic divergence (Yeaman, 2015), loci with small fitness effect sizes are difficult to identify from empirical data. The same is true for phenotypic effect sizes; loci of large effect are detected more easily (Rockman, 2012). Empirical work often focuses on loci with large phenotypic and fitness effects, for example stickleback plate armour (Colosimo et al., 2005) and pelvic spine reduction (Shapiro et al., 2004; Chan et al., 2010), but the general pattern remains unclear (e.g. Seehausen et al., 2014).

At equilibrium, differentiation at a single two-allele barrier locus in a pair of hybridizing populations of constant size and with constant migration rate depends on the magnitude of the barrier effect, as well as drift. This barrier effect, in turn, is determined by the strength of divergent selection, selection against hybrids or assortment directly influencing the barrier locus. How much this level of differentiation stands out from the genomic background depends on migration m and upon the effective population sizes Ne (i.e. via drift). These parameters determine the distribution of baseline differentiation. In addition to elevating values of differentiation (FST) and divergence (dXY) at the barrier locus, the barrier effect also affects surrounding genomic regions (Charlesworth et al., 1997; and see section on loci linked to barrier nucleotides below, as well as Fig. 3), generating peaks of differentiation and divergence that can be detected as outliers in genome scans (Lewontin & Krakauer, 1973; Storz, 2005; Stephan, 2016). In many cases, independent evidence (e.g. experimental data or evidence for parallel evolution) shows that outlier loci are associated with barriers to gene flow (Table 2). However, differentiation is a continuous measure and selection coefficients are continuous as well; therefore, separating loci into two distinct classes, outliers and nonoutliers, is an oversimplification.

Details are in the caption following the image
How does relative differentiation depend on distance from a barrier locus? How does it vary over time following the onset of a selective sweep during divergence with gene flow? Here, we show relative differentiation (FST) for a locus under divergent selection averaged over 5000 independent evolutionary histories under three distinct scenarios; primary divergence with a soft (a) and hard (b) sweep, and secondary contact with a hard sweep occurring during a period of isolation (c). For each panel, the heat map shows FST (see colour bar) as a function of time since the onset of selection and as a function of physical distance (shown here instead of recombination rate for illustrative purposes) from the barrier locus; the solid line shows the frequency of the allele undergoing a selective sweep in the population where it is beneficial (note that this corresponds to the allele frequency on the right-hand axis). The horizontal grey bar separates loci that are linked to the target of selection (below) from unlinked loci (above). Note that when a sweep is accompanied by gene flow (A and B) the extent of elevated differentiation is relatively small immediately after a sweep because time is required for drift and mutation to allow allele frequency differences to establish, even though effective migration rate is lowered by selection. In (c), the extent of differentiation is initially greater because the sweep occurs in isolation, but gene flow erodes this to some extent following secondary contact. Also note that recombination rate and not physical distance was used in the simulations; details of the mapping between these variables and other parameters used are explained in the Supplementary Material.
Table 2. Examples of studies where, alongside genome-scan data, additional evidence has been used to demonstrate that selection occurs at outlier loci. Here, we distinguish between studies that demonstrate a genotype–phenotype link (upper table section), which requires separate evidence of selection on the phenotype, and studies that show signatures of selection on the genotype (lower table section). We note that in some cases, for example lateral plate armour in three-spined sticklebacks, there are overlaps between these categories
Type of evidence Description Caveats Examples
QTL mapping and other mapping approaches Identifies genomic basis of known divergent trait or hybrid incompatibility; correspondence of QTL with outliers provides strong evidence for selection Narrowing genomic region requires large numbers of individuals and high density of markers. Potential bias towards large-effect or clustered loci Overlap between QTL and outliers for benthic – limnetic whitefish (Rogers & Bernatchez, 2005). Allele frequency shifts at SNPs linked to QTL for skeletal morphology in lake–stream sticklebacks (Berner et al., 2014). Reduction in sperm number maps to sex chromosomes in Pacific Ocean–Japan Sea stickleback cross (Kitano et al., 2009)
Gene ontology analysis Test whether outliers have functions that are expected to be divergent (based on observations of phenotypic divergence or known selection pressures) Relatively weak evidence, limited by annotation quality Groundsels on different soil types often have different outliers, but similar annotations (Roda et al., 2013). Flowering time genes divergent across latitudinal gradient in sunflowers (Renaut et al., 2013)
Molecular assay Functional assays of gene products using in vitro methods Usually cannot be formed using study organism Cichlid opsin light absorbance (Terai et al., 2006). Expression of Pocket mice Mc1r alleles in cultured cells (Hoekstra et al., 2006)
Transgenics Insertion or deletion of alleles into different genetic background and observation of phenotype Technically difficult for most organisms Insertion of high-plated Eda allele into low-plated genomic background (Colosimo et al., 2005) and restoration of pelvic spine phenotype in sticklebacks (Chan et al., 2010)
Knockout/knockdown Deletion, disruption or suppression of genes underlying divergence traits to demonstrate phenotypic effects Can only demonstrate loss of function. Target fidelity is difficult to control Knockdown of genes related to albinism in cavefish (Bilandžija et al., 2013) and doublesex gene controlling mimicry patterns in Papilio butterflies (Nishikawa et al., 2015)
Cline analysis Steep clines across hybrid zones expected for loci under strong ecological selection or involved in intrinsic barriers to gene flow Recombination, mutation rate and population demography can distort clinal data Overlap between outlier loci and steep genomic clines in bivalve subspecies (Luttikhuizen et al., 2012). Loci with steep clines at genes known to be involved in RI (Trier et al., 2014)
Parallel evolution Parallel differentiation at the same locus, genomic region or gene class across multiple geographically and phylogenetically independent species/population pairs Parallel differentiation caused by shared genomic constraints – that is background selection and low recombination – must be ruled out. Parallel differentiation also produced by secondary contact Same loci involved in marine–freshwater stickleback divergence across large geographical scales (Jones et al., 2012). Increased differentiation among stream populations flanking genomic regions involved in phenotypic differentiation in lake–stream sticklebacks due to propagating selective sweeps (Roesti et al., 2014)
Experimental crosses Observation of segregation distortion, hybrid sterility or hybrid inviability allows for identification of intrinsic incompatibilities Cross designs can be complicated and often only possible in model organisms – particularly when inviability is present. Cannot always identify extrinsic selection against hybrids Crosses between Drosophila subspecies show male sterility and segregation distortion (Phadnis & Orr, 2009). Evidence of ecological incompatibilities from limnetic–benthic stickleback crosses (Arnegard et al., 2014)
Transplant experiments Transplanting hybrids or individuals from divergent habitats into a maladaptive environment results in changes in allele frequency or a reduction in fitness and survival Not feasible for some species and also difficult to discount selection on other adaptive loci Switching stick insects between host plants (Gompert et al., 2014); transplant of marine sticklebacks with known lateral plate Eda genotype demonstrates reduced fitness and allele frequency shifts (Barrett et al., 2008)

Even if an outlier scan correctly identifies a genomic region containing a barrier locus, narrowing the region down to the barrier locus itself may be difficult. This is partly because measures of differentiation are noisy, due to stochasticity in coalescence as well as sampling (Fig. 4), but also due to the resolution of the scan and the chromosomal scale influenced by the barrier effect: large blocks of linkage disequilibrium can occur in some species. Given the complexity and cost of dealing with whole-genome data, particularly in nonmodel organisms, the vast majority of genome-scan studies still make use of reduced-representation sequencing approaches (Davey et al., 2011; Andrews et al., 2016). In these cases, outlier markers may frequently show high differentiation because they are linked to a barrier locus, rather than being the direct target of selection. For genomic regions under selection, multiple SNPs may often show elevated differentiation (hence the island concept – see Box 2), although there may be variance among sites because of drift-related stochasticity. For this reason, differentiation in whole-genome data is usually calculated across a window spanning multiple variants rather than using single nucleotides. However, the resolution of this approach might mean differentiated regions are missed (Hoban et al., 2016).

Details are in the caption following the image
Stochasticity complicates interpretation of the genomic landscape. (a) A single realization of the landscape of relative differentiation 10 000 generations since the onset of selection (blue line) follows the general expected signature of a hard sweep in primary contact derived from an average across many simulations (black line) but is variable due to stochastic effects. Variation over time, for the same realization, is also apparent (b) where peaks of differentiation distant from the target of selection arise and then disappear before the realization reaches equilibrium. As with Figure 3, parameters used to generate these results are given in the Supplementary Material.

While remaining a formidable challenge in many study systems, identifying the actual loci and substitutions responsible for barrier effects (e.g. underlying divergently selected phenotypic traits or causing hybrid incompatibility) will undoubtedly improve our understanding of the speciation process. In some cases, introgression across hybrid zones may provide the necessary precision for identifying speciation genes or at least understanding how they interact. Otherwise, the strongest evidence for the role of individual substitutions is most likely to come from experimental approaches, such as mapping studies followed by the generation of transgenic individuals (Colosimo et al., 2005; Cong et al., 2013). Importantly, the promising future for approaches such as CRISPR (Bono et al., 2015; see Section 3) may also provide information about pleiotropy, dominance and other effects that are important to understand the role of barrier loci in divergence and speciation (Storz & Wheat, 2010; Seehausen et al., 2014).

Loci linked to barrier loci

Linkage causes the genomic effects of barriers to extend beyond barrier loci, as divergent selection reduces the local effective migration rate at linked loci. At equilibrium, the effective migration rate me can be approximated as me = m/(1 + s/r) in the limit of small m, s, r (Barton & Bengtsson, 1986). For idealized populations at equilibrium, the relationship between FST and me is simple (Slatkin, 1991); therefore, the expectation is that differentiation peaks at the barrier locus and decreases with physical distance. This is one rationale for the use of reduced-representation genome scans (e.g. those based on RAD-seq). Markers may not necessarily be under selection but rather indicate the presence of barrier loci by showing elevated FST due to linkage.

However, the simple relationship between me and FST only holds for the situation of equilibrium between migration, selection, mutation, and drift (Whitlock & McCauley, 1999). One important departure from equilibrium happens during and after a selective sweep (Box 1), where an adaptive allele increases in frequency. In Fig. 3, we demonstrate the development of FST from a transient state towards equilibrium at a locus under divergent selection. A locally adaptive allele experiences either a soft sweep under continuous gene flow (Fig. 3a), a hard sweep under continuous gene flow (Fig. 3b), or a hard sweep in allopatry followed by secondary contact for comparison (Fig. 3c; see Supplementary Material for more details on the simulations run and parameters used to generate these illustrations).

For a sweep at a locus under divergent selection with continuous gene flow, average differentiation is increased close to the selected locus during and immediately after the sweep due to a temporary reduction of within-population diversity. The extent of the local sweep effect depends on the strength of selection, upon the starting allele frequencies at the selected locus (i.e. whether the sweep was ‘hard’ or ‘soft’; see Box 1 and compare Fig. 3a,b), and on the time since the sweep occurred (Przeworski, 2002; Hermisson & Pennings, 2005; Pennings & Hermisson, 2006; Messer & Petrov, 2013). However, the genomic region where average FST is increased is relatively small immediately after the sweep, and grows towards equilibrium (i.e. from left to right in Fig. 3a,b). This is because the haplotype (or haplotypes) sweeping to high frequencies contain common alleles at most loci, initially generating little differentiation. Therefore, FST may initially remain low even in genomic regions where me is reduced due to linkage. However, over time, this reduced me allows for an accumulation of allele frequency differences due to both drift and new mutations. These patterns indicate that barrier loci that have undergone local sweeps in the face of gene flow may be more easily detectable when they are closer to equilibrium, because the proportion of surrounding loci showing elevated differentiation increases with time after the sweep (Fig. 3a,b). However, it is unclear how quickly equilibrium is approached (Wood & Miller, 2006; Bierne, 2010; Yeaman et al., 2016). This approach may be slow because it requires both mutation and rare recombination events, suggesting many loci in empirical studies are not at equilibrium. In a transient state, where equilibrium is not yet reached (e.g. because the adaptive mutation and increase in frequency occurred only recently, or because of recent secondary contact), the distribution of FST along the chromosome is strongly contingent on the local genomic history and is not necessarily indicative of me.

Moreover, both at equilibrium and in a transient state, observed patterns of FST may rarely correspond to theoretical expectations as they are always affected by stochasticity. In Fig. 4a, we compare the outcome of a single evolutionary history to differentiation averaged over 5000 independent histories to show how stochasticity can cause deviations from the expectation. This can lead to false positives, that is high-FST loci that do not actually indicate a barrier locus, and false negatives, that is low-FST regions that are closely linked to a selected locus (see Fig. 4 for examples of both). It is clear that during the transient state, a local hard sweep may cause multiple loci to show high differentiation, interspersed by low-FST regions (Fig. 4b). This can be explained by the fact that the haplotype the selected allele occurs on harbours both common and rare neutral alleles. These hitchhiking rare alleles will increase in frequency with the sweep, resulting in transient high-FST peaks that may be quite distant from the selected locus, especially if selection is strong and the sweep is rapid. In genome scans, such peaks could easily be mistaken for further selected loci, and distinguishing them from the actual locus under selection may be difficult; however, this effect is less likely for soft sweeps, where rare alleles are very unlikely to rise to high frequency. Over time, differentiation at distant loci will be lost due to gene flow, recombination and drift, reducing the probability of such false positives as equilibrium is approached (see Fig. 4 for the persistence of these peaks over time). Clearly, if history and stochasticity are not taken into account, genome-scan data may easily be misinterpreted.

In some cases, the contrast between FST and dXY is likely to be helpful for distinguishing between transient states and equilibrium, facilitating the correct interpretation of outlier loci (Noor & Bennett, 2009; Cruickshank & Hahn, 2014; Delmore et al., 2015; Irwin et al., 2016). Relative measures such as FST may miss the distinct effects on diversity and divergence (Charlesworth et al., 1997), and peaks of differentiation can be present for both recent local sweeps (transient) and at equilibrium (see above and Fig. 3). Measures of absolute divergence such as dXY in regions surrounding barrier loci take longer to increase via the establishment of new mutations. Recent local sweeps should be characterized by FST peaks lacking elevated dXY, whereas at equilibrium both FST and dXY are expected to be higher in the vicinity of barrier loci because of the reduction of effective migration rate (Fig. 5). Unfortunately, such distinct behaviour of FST and dXY might not apply to more complex scenarios involving secondary or intermittent contact. In addition, selection in the ancestral population before divergence occurs can lower dXY below the genomewide baseline (Cruickshank & Hahn, 2014). These scenarios need further investigation.

Details are in the caption following the image
How quickly do the expected signatures of relative differentiation (FST – solid line) and absolute divergence (dXY – dashed line) converge? Comparison of averages of these two measures at a neutral locus 500 kb from the target of selection during a hard sweep in primary contact as a function of time since the onset of selection. Averages are made over 20 000 independent evolutionary histories. Additional parameters for generating the results are identical to those in Figure 3 – note that recombination rate and not physical distance was used in the simulations, details of this can be found in the Supplementary Material.

The spread of barrier effects to linked neutral loci is uncontroversial. More contentious is the effect of a barrier locus on divergence of linked loci that are also under divergent selection. Some FST outlier analyses have identified loci that occur in proximity to QTL, for example at a distance of ~10 cM in pea aphids (Via & West, 2008), and divergently selected loci may cluster in the genome (Yeaman & Whitlock, 2011; Yeaman, 2013; Rafajlović et al., 2016). Moreover, in some species, highly differentiated genomic regions appear to increase in size along the speciation continuum (Feder et al., 2012a; Renaut et al., 2012). These findings suggest that further divergence might be more likely in the vicinity of existing barrier loci and that this might lead to a growth of highly differentiated genomic regions. Conceptual thinking has focused on one potential explanatory mechanism, divergence hitchhiking (Via & West, 2008; Feder et al., 2012a,b). Under this framework, reduced me around divergently selected loci may facilitate the establishment of new mutations under weak divergent selection in their vicinity potentially resulting in clustered genetic architectures (Yeaman & Whitlock, 2011; Feder et al., 2012a; Nosil & Feder, 2012a; Via, 2012; Yeaman et al., 2016), and a subsequent increase in the size of differentiated regions (Feder et al., 2012a; Via, 2012; see also next section). However, using multilocus simulations Feder & Nosil (2010), Feder et al. (2012b) demonstrated that divergent selection facilitated the establishment of weakly adaptive mutations only under limited conditions when selection is strong, Ne is small and migration is low. Furthermore, if divergence hitchhiking does occur, Hill–Robertson interference may prevent weakly adaptive alleles from establishing when they arise in habitats or genomes where they are maladaptive and they are unable to escape via recombination (Feder et al., 2012b; Yeaman, 2015). Clustering under high migration load can be facilitated by chromosomal rearrangements, and during long periods of adaptation, combinations of tightly linked loci involved in adaptation can outcompete those in looser linkage (Yeaman & Whitlock, 2011; Yeaman, 2013, 2015). Alternatively, when migration and drift are strong but selection is weak, clustering may occur because weak differentiation is better protected from loss via drift when linkage to a strongly diverged locus is tight (Rafajlović et al., 2016).

Barriers and genomewide effects

When there are only few barrier loci, their genomewide effect is usually small because most of the genome can easily recombine from one background to another. However, as speciation progresses (Coyne & Orr, 2004) and the number of barrier loci becomes large, separating their different effects becomes more difficult. Barrier loci may experience a reduction in local me both due to direct selection and due to indirect effects of linked and unlinked loci. Neutral loci throughout the genome are subject to indirect effects too, potentially resulting in a strong genomewide barrier. Barton (1983) showed that a sharp transition from independent barrier effects to such genomewide effects depends on the ratio of total selection to total recombination among loci. He called this ratio the ‘coupling coefficient’. The effect of coupling applies to all types of barriers (Kruuk et al., 1999; Bierne et al., 2011), and to situations with primary divergence with gene flow (Barton & de Cara, 2009; Abbott et al., 2013). Beyond the transition to genomewide barriers, the genomic landscape of differentiation should tend to become less structured, making barrier loci progressively more difficult to detect against increasing background differentiation. Estimating the strength of selection on individual barrier loci becomes difficult following the transition, as indirect effects increasingly contribute to their differentiation.

Selection on multiple traits, that is multifarious selection, is thought to be more likely to facilitate speciation than strong selection on a single trait (Rice & Hostert, 1993; Nosil et al., 2008; Nosil, 2013). Similarly, selection against migrants at multiple loci results in a stronger barrier to gene flow, reducing effective migration rate across the genome when overall selection is the same (Barton & Bengtsson, 1986; Feder et al., 2012b). This allows new locally adaptive mutations to establish, independent of their genomic position, even if their effect size is relatively small; it also facilitates an increase in genomewide divergence at neutral regions due to drift (Feder et al., 2012b). This process has been termed genome hitchhiking (Feder et al., 2012a) and it essentially describes the impact of multifarious divergent selection when coupling is strong. Flaxman et al. (2014) used simulations to demonstrate that statistical associations among a large number of genes combined with divergent selection can interact to drive a rapid transition from local to genomewide barrier effects. This genomewide congealing (GWC) resembles the coupling transition predicted by Barton (Flaxman et al., 2014; Tittes & Kane, 2014). During progression towards speciation in their model, numerous, weakly selected mutations occur but are unable to generate differentiation due to the effects of gene flow. Following a transition from local to genomewide barriers, however, the contribution of these mutations to reproductive isolation increases as the genomewide me is reduced below a threshold and LD increases (Tittes & Kane, 2014). Importantly, GWC does not require physical linkage or periods of allopatry that might elevate LD among loci (Tittes & Kane, 2014). Nonetheless, Flaxman et al. (2014) demonstrate that genomic features such as chromosome length or clustering of adaptive loci on specific chromosomes, as well as periods of geographical isolation, can drastically reduce the waiting time to GWC. Simulations show that both genome hitchhiking and genomewide congealing are able to occur under a wide range of parameters provided there is selection on many loci (Feder & Nosil, 2010; Nosil & Feder, 2012b). However, as with divergence hitchhiking, empirical evidence showing that genome hitchhiking allows weakly adaptive alleles to establish remains elusive.

Section 2: Other factors modifying the genomic landscape

As we have seen, even in relatively simple situations with fixed population sizes and constant migration, the genomic landscape is complicated by linkage, history, and the accumulation of barrier effects. We have yet to consider additional modifying factors such as demographic history, genomewide heterogeneity in mutation and recombination rates, background selection, and gene density (Fig. 2).

Demographic and evolutionary history

Understanding the demographic and evolutionary history of population and species pairs is necessary to generate expected patterns of genomic differentiation. Fluctuations in effective population size (Ne) can have a profound effect in this regard; for example, when Ne is small, the effect of drift is greater, whereas selection is more efficient when Ne is large (Charlesworth et al., 2003; Charlesworth, 2009; Charlesworth & Charlesworth, 2010). Pronounced changes in Ne such as bottlenecks can shift the mean and variance of baseline genomic differentiation, making it difficult to identify regions affected by divergent selection (Ferchaud & Hansen, 2016). This is also an issue for outlier detection; the assumption of simple demographic models that fail to incorporate fluctuations in effective population size may derive a null expected distribution of differentiation that does not correspond to the true distribution (Lotterhos & Whitlock, 2014; Hoban et al., 2016). Furthermore, Ne is an important parameter to estimate because, as well as determining the effectiveness of selection, it influences scaled mutation and recombination rates; for example, scaled mutation rate, 4Neμ or θ determines the rate at which adaptive mutations enter a population (Hartl & Clark, 2007; Charlesworth & Charlesworth, 2010).

We have emphasized the need to test for gene flow (see Box 3) to better appreciate the relative role of alternative processes explaining a landscape of heterogeneous genomic differentiation (see Box 2). When populations or species meet, the landscape may point to barrier loci resistant to gene flow (Harrison & Larson, 2016), but without accounting for divergence history, it is not clear whether these populations have diverged in situ or have resisted genomewide homogenization upon secondary contact between divergent lineages (Bierne et al., 2013; Feder et al., 2013; see also Fig. 1). First, with recent secondary contact, peaks of differentiation may just reflect loci that differentiated due to drift during allopatry, and have yet to be homogenized by gene flow. Such spurious outliers may obscure or hinder the detection of true barrier loci. Second, the genomic signatures of divergently selected loci may also differ between primary divergence and secondary contact (Fig. 3). With primary divergence, during and immediately after a local selective sweep, transient high-differentiation peaks will occur at large distances from the selected locus but these are eroded by recombination and migration (Fig. 4a,b). In contrast, during allopatry, this erosion does not happen, generating large regions of high differentiation, which will be maintained for some time after secondary contact. Therefore, for local sweeps of comparable age, differentiated regions will often be much larger (and therefore potentially easier to detect) in secondary compared to primary divergence as migration has had less time to act. Recent studies have explicitly tested for primary vs. secondary contact, allowing for a more accurate interpretation of genome-scan data; a wide array of tools are available for this sort of approach (Sousa & Hey, 2013; Wolf & Ellegren, 2016; see also Box 3). Some have provided support for primary divergence (Nosil et al., 2012; Butlin et al., 2014), whereas others indicate that secondary contact after a period of isolation best explains heterogeneous differentiation (Tine et al., 2014; Roesti et al., 2015; Martin et al., 2016; Rougemont et al., 2017).

Even sophisticated statistical frameworks for testing divergence hypotheses only consider a small proportion of the ‘universe of potential historical scenarios’ (Knowles, 2009). Divergence history varies across the genome due to nonuniformity in effective migration rate, effective population size and recombination (Maddison, 1997; Roux et al., 2014, 2016; Mallet et al., 2016). Gene-tree vs. species-tree discordance can occur because of introgression (Maddison, 1997; Knowles & Maddison, 2002; Geneva et al., 2015; Rosenzweig et al., 2016), but also because of incomplete lineage sorting (ILS) (Hobolth et al., 2011b; Dutheil & Hobolth, 2012). Described as ‘deep coalescence’ by Maddison (1997), ILS occurs when the most-recent common ancestor for a genealogy exists before speciation begins, resulting in counter-intuitive three taxa phylogenies (Scally et al., 2012) or distortions of divergence time estimates between two species (Leaché et al., 2013). ILS therefore increases the variance of genomic divergence estimates, making it difficult to identify true outliers and also potentially introducing false positives. ILS affects a greater proportion of the genome when speciation events occur close in time and the ancestral effective population size is large (Barton, 2006; Hobolth et al., 2011b). This presents an obvious challenge to studies of multiple species pairs or adaptive radiations (Mallet et al., 2016). Furthermore, stochasticity in divergence times and ILS at neutral loci can generate false signals of both genomic divergence and gene flow between species pairs (Barton, 2006; Pease & Hahn, 2013; Cruickshank & Hahn, 2014). Incorporating demographic history in tests for selection is difficult as incorrect specification of the history, potentially generated by ILS patterns, increases error rates (Lotterhos & Whitlock, 2014; Aeschbacher et al., 2016; Fraïsse et al., 2016; Hoban et al., 2016; Le Moan et al., 2016). Approaches that do not use demographic models may be preferable in some cases although these too are prone to bias (Hoban et al., 2016).

Speciation is undoubtedly complex, unfolding in space and time with populations overlapping, contracting and re-expanding (Butlin et al., 2008; Abbott et al., 2013; Seehausen et al., 2014). This complexity suggests that most species have probably evolved with gene flow occurring at some point in their evolutionary history (Nosil, 2008; Smadja & Butlin, 2011) and that the process cannot easily be delineated into primary vs. secondary contact or with vs. without gene flow (Bierne et al., 2013; Cruickshank & Hahn, 2014). A genic perspective on speciation predicts that divergence history will vary across the genome (Wu, 2001; Wu & Ting, 2004); therefore, the history of barrier loci might not necessarily reflect the history of populations, as Heliconius butterflies, Anopheles mosquitoes and marine–freshwater sticklebacks appear to show (Bierne et al., 2013; Mallet et al., 2016). Adaptive alleles may evolve during a period of geographical isolation but introgress between divergent lineages via hybridization and only act as barrier loci in a later phase of in situ divergence between populations (Feder et al., 2003; Bierne et al., 2011, 2013). Ancient divergence times for adaptive variants in several systems also suggest that these alleles are maintained as standing variation and over time spread between multiple populations as a result of gene flow, repeatedly becoming involved in divergence (Colosimo et al., 2005; Lamichhaney et al., 2015; Fraïsse et al., 2016). Inversions may play an important role in this regard as they have a higher probability of establishing and contributing to standing adaptive variation following secondary contact compared to primary divergence (Feder et al., 2011, 2013). These standing structural variants shield co-adapted alleles from the antagonistic effect of recombination and may spread among populations, facilitating further episodes of adaptation and diversification; indeed there is empirical evidence for the role of older structural variants in population divergence (Feder et al., 2003; Xie et al., 2007). Coupling between independently evolved ancient adaptive alleles and incompatibilities due to selection across environmental gradients may also drive progress towards speciation over shorter timescales (Barton & de Cara, 2009; Bierne et al., 2011, 2013; Abbott et al., 2013). As well as ancient adaptive variants, intrinsic genomic incompatibilities arising from epistatic interactions appear to segregate within species (Shuker et al., 2005; Corbett-Detig et al., 2013). Although such incompatibilities are difficult to detect, their presence suggests the possibility of widespread potential for coupling with adaptive alleles.

Mutation rate variation

In the absence of gene flow and selection, neutral diversity within and divergence between populations scales with mutation rate. In the human genome, for example, nucleotide diversity is positively correlated with de novo mutation rate, which in turn accounts for a third of sequence divergence variation between humans and chimpanzees (Francioli et al., 2015). Mutation rate variation among species, populations and individuals and the implications of this for evolutionary inference are relatively well understood (Drummond et al., 2006; Ho & Larson, 2006; Hodgkinson & Eyre-Walker, 2011). However, absolute mutation rates (i.e. the number of mutations per site and generation) are also nonuniform across the genome (Hodgkinson & Eyre-Walker, 2011; Ness et al., 2015). Mutation probability is influenced by G:C bases and neighbouring base identity (Hodgkinson & Eyre-Walker, 2011; Ness et al., 2015). Replication timing also has an effect, with longer exposure to mutagens during transcription in late replicating regions (Hodgkinson & Eyre-Walker, 2011; Francioli et al., 2015). Mutation rate is often higher on Y chromosomes than the X or autosomes because 100% of Y chromosomes occur in males, experiencing higher mutation rates due to spermatogenesis (Hodgkinson & Eyre-Walker, 2011). Despite knowledge of mechanisms causing mutation rate variation, it remains contentious whether systematic genomewide variation occurs at a scale that might bias genome scans. For example, although Ness et al. (2015) detected fine-scale heterogeneity in mutation rate, they found no clear variation among 200-kbp genome windows (Hodgkinson & Eyre-Walker, 2011), suggesting that the extent of any bias in genome scans will also differ with the scale of the analyses.

Irrespective of the scale at which it varies, mutation rate is an important population genetic parameter used to scale estimates of parameters such as effective population size (Ne) and divergence time (t) derived from genomic data. As Ne is typically estimated from θ (4Neμ – scaled mutation rate on autosomes, where μ = absolute mutation rate), assuming a uniform mutation rate will inflate estimates of Ne for mutational hot spots, obscuring the extent to which drift or selection contributes to divergence in these genomic regions (Charlesworth, 2009). Furthermore, given the importance of estimating demographic parameters for determining how and when speciation has occurred (see Demographic and evolutionary history), uniform mutation rates incorrectly applied across the genome may conceal the history of barrier loci and the speciation process (Scally & Durbin, 2012). Mutation rate variation also has implications for genomic differentiation; high mutation rate at some genomic regions may downwardly bias local measures of relative differentiation, for example FST, obscuring loci putatively under selection (Foll & Gaggiotti, 2008). Absolute divergence measures such as dXY are also subject to bias due to mutation rate variation; a low mutation rate will result in low levels of divergence, potentially giving a false impression of constraint or introgression (Geneva et al., 2015; Rosenzweig et al., 2016).

Genomewide mutation rate variation should be taken into consideration to interpret the genomic landscape accurately. To date, our understanding of intragenomic mutation rate variation remains limited and is drawn from a relatively small number of model organisms. Quantifying this heterogeneity is a major endeavour even with high throughput sequencing technologies (Ness et al., 2015). Nonetheless, there is considerable promise for incorporating mutation rate estimates into predictive models (Francioli et al., 2015; Ness et al., 2015; see Section 3: Road map).

Background selection and selective sweeps at nonbarrier loci

Advantageous mutations involved in adaptive evolution are of greatest interest in speciation research as in many cases, these generate the barrier alleles we wish to detect (see Section 2; Seehausen et al., 2014). However, they are rare; most non-neutral de novo mutations are likely to be deleterious (Ohta, 1992; Eyre-Walker & Keightley, 2007), and their removal from populations by selection, i.e. background selection, can shape the genomic landscape of variation in a similar way to positive selection on adaptive alleles (Charlesworth et al., 1993; Stephan, 2010). Purging of deleterious mutations by purifying selection removes neutral variation at linked sites, reducing genetic diversity and local effective population size (Charlesworth et al., 1993; Charlesworth, 2012; Cutter & Payseur, 2013).

Like the other processes described in this section, the extent of background selection varies across the genome. Evidence from Drosophila melanogaster suggests it is highest on autosomes, accounting for 58% of the observed variation in nucleotide diversity across 100 kb windows (Comeron, 2014). Simulations based on theoretical approximations show the effects of background selection on patterns of diversity are greatest when deleterious mutation rate is high and recombination rate is low, that is when linked neutral sites are unable to escape via recombination from new mutations entering a population (Charlesworth et al., 1993; Charlesworth, 2012). Background selection should be higher in genomic regions with a high density of coding sequence, where mutations are more likely to have deleterious effects; this is supported by lower diversity in these regions (Lohmueller et al., 2011; Cutter & Payseur, 2013; Enard et al., 2014). Whether or not mutations are deleterious within a coding region may vary with proximity to optimum fitness on an adaptive landscape; when a population is close to maximum fitness, a greater proportion of mutations will be deleterious, causing a shift away from the optimum (Orr, 1998; Cutter & Payseur, 2013). On a genomewide level, drastic reductions in effective population size can limit background selection as the frequencies of new deleterious mutations are more strongly influenced by drift (Charlesworth, 2012).

Despite being different processes, background and positive selection may produce similar patterns of reduced intraspecific diversity and increased interspecific genomic differentiation in genome scans using relative measures like FST (Noor & Bennett, 2009; Cruickshank & Hahn, 2014). Distinguishing between them is important to identify barrier loci under divergent selection and rule out false positives; ideally, positive selection should be tested against a null-evolutionary model that incorporates background selection (Cutter & Payseur, 2013; Comeron, 2014; Zeng & Corcoran, 2015; Elyashiv et al., 2016). Predictive models incorporating background selection are able to estimate the contribution of the process to differentiation (Lohmueller et al., 2011; Comeron, 2014; Zeng & Corcoran, 2015; Elyashiv et al., 2016). Similarly, outlier analyses and demographic inferences that account for signatures of background selection are more robust, with fewer false positives (Ewing & Jensen, 2016; Huber et al., 2016). To date however, only a few studies have attempted to account for background selection in the context of speciation and barrier loci (Roesti et al., 2013; Burri et al., 2015; Delmore et al., 2015; Feulner et al., 2015; Christe et al., 2017; Vijay et al., 2016).

Global selective sweeps of universally adaptive alleles (see Box 1), that is those adaptive in both diverging populations, may also generate signatures similar to barrier loci. Divergence history may involve phases of allopatric isolation, during which universally adaptive mutations can become fixed in only one subpopulation because gene flow is absent (Fig. 1). This generates a peak of differentiation that will decay with the introgression of the adaptive allele to the other subpopulation when contact and gene flow are restored. However, homogenization of allele frequencies after secondary contact does not occur instantaneously, and peaks of differentiation will be maintained during early phases of gene flow, potentially being misinterpreted as indicating barrier loci (see Fig. 1). Similar effects may occur at loci that do not contribute to adaptation to the environment or speciation at all, but that are subject to sexual selection, genomic conflict or drift occurring independently in geographically isolated subpopulations.

Even with continuous gene flow, recent sweeps of universally favourable alleles may temporarily generate high-differentiation peaks. The spread of favourable mutations among subpopulations will take time and can cause temporary allele frequency differences, especially if subpopulations are large or the magnitude of gene flow between them is relatively low. Furthermore, the original hard sweep will strongly reduce diversity in regions flanking the selected locus, leading to a single haplotype at high frequency in the source population. The lag time between mutations arising and spreading means recombination events between the flanking haplotype and others are more likely to occur in the second subpopulation (i.e. a soft sweep). Consequently, different haplotypes will increase in frequency in the second subpopulation, leading to elevated differentiation at regions flanking the selected locus, but not the selected locus itself, generating two adjacent peaks (Bierne, 2010; Roesti et al., 2014). This signature may be distinguishable from a single peak of divergent selection, but only if sufficiently large chromosomal regions are studied. Similar effects might occur at loci where an allele is adaptive in one, but neutral in the other population (i.e. loci with locally adaptive alleles but without divergent selection; Box 1). When these alleles first appear where they are adaptive (e.g. by mutation or gene flow), they might increase in frequency rapidly; where they are neutral, they will increase in frequency due to gene flow, but do so much more slowly (Vatsiou et al., 2016).

Recombination rate variation

With a uniform recombination rate across the genome and at equilibrium, the width of a genomic region of differentiation surrounding a barrier locus is directly proportional to the strength of the barrier effect (Barton & Bengtsson, 1986). In reality, however, recombination rate varies widely across the genome of most species studied (Jensen-Seaman & Furey, 2004). This may be associated with chromosome type (i.e. sex chromosomes vs. autosomes), distance to the centromere, GC content, CpG motifs, transposable elements, polyA and polyT sequences, gene density and recombination modifier genes (Butlin, 2005; Smukowski & Noor, 2011 and references therein), or, on a fine scale, with recombination hot spots (Myers et al., 2010; Massy, 2013). As many of these factors are associated, determining the true cause of recombination rate variation is difficult but its effects on genomic variation are more predictable. A barrier locus will influence a larger genomic region and will increase measures of genetic differentiation (FST) and divergence (dXY) when it occurs in a low-recombination region compared to a high-recombination region (Stephan, 2010; Nachman & Payseur, 2012; Cutter & Payseur, 2013). Therefore, it might be easier to detect in a genome scan, but harder to narrow down to small functional regions or individual nucleotides. This alone is justification enough to account for recombination rate variation when interpreting patterns of differentiation across the genome (Nachman & Payseur, 2012; Roesti et al., 2012). However, a strong correlation between recombination rate and nucleotide diversity (Begun & Aquadro, 1992; Comeron et al., 2008; Cutter & Payseur, 2013) suggests that recombination rate variation can confound interpretation of the genomic landscape in other ways too.

Although recombination rate has a mutagenic effect, this does not appear to be correlated with genomic divergence (Noor, 2008; Charlesworth & Campos, 2014). Indeed, controlling for mutation rate variation shows recombination determines the extent of human–chimpanzee divergence in other ways (Francioli et al., 2015). Background selection reducing genetic diversity in regions of low recombination is a compelling explanation for these patterns (Charlesworth et al., 1993; Cutter & Payseur, 2013). Neutral alleles in low-recombination regions are more frequently in LD with deleterious mutations and so experience a stronger purging effect (Charlesworth et al., 1993; Charlesworth, 2012). This leads to a reduction in within-population diversity, whereas measures of absolute divergence (dXY) remain largely unaffected, provided gene flow is sufficiently low (Charlesworth et al., 1997; Noor & Bennett, 2009; Cruickshank & Hahn, 2014; Zeng & Corcoran, 2015; but see also Phung et al., 2016). However, measures of relative differentiation (FST) will be inflated and some regions may appear as outliers. High differentiation between species has indeed been observed in low-recombination regions, for example close to centromeres and at the end of chromosome arms (Nachman & Payseur, 2012; Roesti et al., 2012). Nonetheless, it remains unclear how low gene flow between populations must be for background selection in recombination cold spots to cause false positive signals of differentiation in outlier scans.

Importantly, recombination can influence selection beyond its signature in genome scans. Close linkage between loci prevents their independent evolution. This may obscure the signatures of barrier loci. For example, when a globally highly beneficial allele appears in tight linkage to a barrier locus, one of the alleles at the barrier locus may sweep to high frequency in both populations and would appear to show a signal of high gene flow. Interference between loci could also cause the reverse pattern, and make nonbarrier loci appear as barriers. For example, imagine two tightly linked loci, with one containing an allele that is adaptive in population 1 and neutral in population 2, whereas the other contains an allele that is adaptive in population 2 and neutral in population 1. If recombination is low, the two locally adaptive alleles might never appear together on the same chromosome. In this case, there is one haplotype that is favoured in population 1, and one that is favoured in population 2 – the loci effectively behave like a single locus under divergent selection. In contrast, with high recombination, a haplotype that contains both adaptive alleles would quickly emerge, and would be likely to spread across both populations.

High recombination allows the independent evolution of individual positions, counteracting Hill–Robertson interference (Stephan, 2010; Gossmann et al., 2014). This is clear from the positive relationship between neutral polymorphism and recombination rate observed in multiple species (Begun & Aquadro, 1992; Comeron et al., 2008; Cutter & Payseur, 2013). However, the relationship between recombination rate and nucleotide divergence is less straightforward. For Drosophila, there is evidence for increased nonsynonymous divergence in genome regions where recombination does not occur; suggesting a reduction in the efficacy of purifying selection and the fixation of weakly deleterious substitutions (Haddrill et al., 2007). However, measures of divergence show no relationship with recombination rate variation outside regions where crossing over is absent (Haddrill et al., 2007; Bullaughey et al., 2008). Furthermore, these patterns may be taxon-specific; there is little evidence of decreased efficacy of purifying selection when recombination is absent in primates and covariates such as GC content or gene density are taken into account (Bullaughey et al., 2008). Our expectation of how measures of genomic differentiation and divergence vary with recombination therefore remains unclear; this and the effect of gene flow on the relationship between recombination and efficacy of selection require urgent further investigation. Additionally, regions of reduced recombination may allow existing barrier loci to shield closely linked, newly established barrier loci under weaker selection from stochastic loss (Rafajlović et al., 2016). Clusters of barrier loci may be more likely to evolve in low-recombination regions and it is possible that regions of reduced recombination evolve because they enhance clustering effects (Yeaman, 2013).

The speciation process can also be expected to alter how recombination varies across the genome; divergent selection between populations connected by gene flow should promote the modification of recombination – for example chromosomal rearrangements that decrease recombination between barrier loci (Kirkpatrick & Barton, 2006; Ortiz-Barrientos et al., 2016). Because recombination is suppressed in heterokaryotypes, linkage disequilibrium between barrier loci can be maintained within chromosomal rearrangements and these are expected to show higher differentiation and divergence than collinear regions that will be homogenized by gene flow (Noor et al. 2001). As with other low-recombination regions, alternative explanations must be ruled out. For example, ancient rearrangements, predating speciation, may show inflated divergence and differentiation compared to the genomewide average (Noor & Bennett, 2009).

Gene density

With the large number of assembled and annotated genomes now available, mapping gene positions and estimating gene density is possible for more and more taxa. This has clearly shown that genes are not randomly distributed across the genome (Hurst et al., 2004; Sémon & Duret, 2006; Al-Shahrour et al., 2010). First, genes may cluster and form gene-rich regions, whereas other parts of the genome may contain hardly any functional loci (Nobrega, 2003; Hellsten et al., 2010). Genes may also be grouped by function, and the expression of these groups may be regulated simultaneously (Hurst et al., 2004; Al-Shahrour et al., 2010). The causes for this are not clear but likely involve tandem duplications, chromatin structure and shared regulatory elements (see Hurst et al., 2004 for a review). Irrespective of their cause, clusters of functionally similar and co-expressed genes are likely to be favoured by selection (Hurst et al., 2002; Al-Shahrour et al., 2010), although clustering may also evolve neutrally (Sémon & Duret, 2006). The nonrandom distribution of genes in the genome, as well as their functional grouping, can influence processes acting throughout the genome, playing an important role in shaping the landscape of genomic differentiation.

Functional genomic regions, which include genes as well as transcription factor binding sites, rDNA and regions coding for microRNAs, are more likely to experience positive and background selection than nonfunctional regions, where mutations have little consequence. Because background selection can reduce Ne locally in the genome, a negative correlation between gene density and polymorphism is expected (Nordborg et al., 2005; Hobolth et al., 2011b; Flowers et al., 2012). Similarly, a higher probability of local selective sweeps in these parts of the genome will reduce within-population diversity (Stephan, 2010). High recombination can limit the impact of such reductions in diversity; polymorphism is positively correlated with recombination rate (Hey & Kliman, 2002; Nordborg et al., 2005). Indeed, it has been demonstrated that gene density can show a positive relationship with recombination rate (Duret & Arndt, 2008; Flowers et al., 2012). This may simply be an emergent property of the transcription process (Kim & Jinks-Robertson, 2012). Alternatively, a higher recombination rate in gene-dense regions might be directly favoured by selection, because both positive and negative selection are more efficient when the extent of Hill–Robertson interference between multiple selected sites is reduced (Hey & Kliman, 2002, see also Recombination rate variation).

Importantly, gene density influences the efficacy of selection independently of recombination rate; for example, selection efficiency is negatively correlated with gene density in regions of both high and low recombination (Hey & Kliman, 2002). However, this only holds true above a threshold level of high gene density, suggesting a trade-off between selective interference and the advantages of co-expression of clustered genes (Hey & Kliman, 2002). This potentially has implications for the spatial proximity of barrier loci in the genome. Increased Hill–Robertson effects due to high gene density relative to recombination rate may be advantageous for the maintenance of clusters of adaptive genes under divergent selection. Beneficial combinations are less likely to be broken up, but will take longer to come together. Barrier loci in gene-dense regions may also need higher selection coefficients to overcome the reduction in local effective population size caused by background selection.

The grouping of genes with related functions can also be expected to influence large-scale mechanisms in the speciation process when gene flow is occurring, for example the evolution of inversion polymorphisms or divergence hitchhiking (see Loci linked to barrier nucleotides). Functional grouping means multiple loci affecting the same divergently selected trait or suite of traits may be physically linked (Hurst et al., 2004; Al-Shahrour et al., 2010). Inversions are mainly adaptive if they capture multiple barrier loci (Kirkpatrick & Barton, 2006; Faria & Navarro, 2010), and the potential for capturing multiple barrier loci in an inversion when gene flow is occurring is higher if they are grouped. Divergence hitchhiking occurs when adaptive mutations arise close to an established barrier locus and are shielded from gene flow. When functionally related genes are closely linked, there is a greater density of targets for selection; therefore, any new mutation occurring in the same part of the genome is more likely to be adaptive than if genes are randomly distributed (i.e. they are more likely to occur within a functional region), and this increases the potential for divergence hitchhiking, although Hill–Robertson effects may counteract this. Similarly, new adaptive mutations would also be better protected against stochastic loss (Rafajlović et al., 2016).

Section 3: A road map for the genomic landscape

The genomic landscape of differentiation has now been described in many species. Both the number of examples and the genomic resolution are increasing, with many studies now providing nucleotide-level descriptions for a large proportion of the genome with multiple samples (examples in Table 1). The problem however is not to generate these descriptions but to interpret them; a difficult challenge because we know that the landscape depends on multiple factors. To identify barrier loci, the parameter of primary interest is the local effective rate of gene flow, me. This is determined by the actual migration rate and the local barrier effect, which comprises the direct barrier effect (if any) and the influence of other barrier loci, mediated by linkage disequilibrium. The influence of indirect barrier effects via physical linkage will depend on local recombination rate and gene density. Both direct and indirect effects, in turn, may be confounded by the impact of population history on the genome, itself dependent on local mutation rate, recombination rate, background selection or global and local selective sweeps not related to species specialization and speciation.

With so many modifying factors, interacting in complex ways, the prospects for disentangling the genomic landscape might seem bleak. We believe this conclusion premature; in this section, we outline a road map for future research in speciation genomics to overcome the issues faced by the field. Our road map will not be feasible in all study systems, but, together with other perspectives on how speciation should be studied (e.g. Wolf & Ellegren, 2016), it should represent a guideline for researchers to work with. Over the last 15 years, since the publication of Wu's (2001) ‘genic view’, a huge number of empirical studies have provided previously unimagined insight into how speciation has progressed, and this number is still increasing. We argue that, with a carefully considered approach, ongoing speciation research will provide us with an even greater understanding of the ‘mystery of mysteries’.

Step 1: Know the study system

Although perhaps obvious, a strong biological background for a study system cannot be overemphasized. Many of the most insightful recent speciation genomics studies have been on taxa with a rich literature on many aspects of their biology such as three-spined sticklebacks (McKinnon & Rundle, 2002; Jones et al., 2012), Rhagoletis flies (Egan et al., 2015) and African cichlids (Keller et al., 2013; Brawand et al., 2014). This background includes a solid understanding of the ecology, reproductive biology, life history strategies and geographical distribution with a special focus on phylogeography and evolutionary history. Crucially, genetic data should be supplemented with other evidence, from a variety of sources such as fossil and historical records or experimental data on movement between populations, to constrain the range of testable scenarios and to provide limits on parameter estimates. Information on the mechanisms of pre- and post-zygotic isolation and the contributions of different components to overall isolation will also aid in the interpretation of barrier loci. Knowledge of the biological background of a system should be used to inform sampling strategies. We additionally recommend broadening the geographical and taxonomic range of sampling where possible to account for unsuspected sources of introgression (e.g. Martin et al., 2015).

Step 2: Establish the extent of gene flow and understand the demographic history

Gene flow is clearly fundamental for studying the genomic basis of reproductive isolation. A study system should therefore be sampled where divergent populations or species meet (Marques et al., 2016; McGee et al., 2016). Testing for and quantifying the extent of gene flow is a crucial prerequisite for interpreting genomic analyses correctly; ideally both genomic and additional evidence of gene flow (e.g. individuals in natural populations showing evidence of introgression) should be identified. Admixture-based approaches that use reference populations of ‘pure’ individuals to infer the ancestry of chromosomal segments in admixed populations are extremely useful in this regard as they can provide locus-specific estimates of ancestry (Hoggart et al., 2004; Price et al., 2009). Recent implementations of these approaches make use of high-density genomic markers, incorporate linkage information, can infer haplotype phase in admixed individuals and are also able to estimate the timing of admixture events (Price et al., 2009; Churchhouse & Marchini, 2013). The quantification of gene flow can also be explicitly linked to an understanding of the demographic history of a pair of populations or species. Reconstructing the evolutionary history is desirable as it can have important effects on the genomic landscape (see Section 2: Demographic and evolutionary history). Care should be taken to distinguish between population level processes such as fluctuations in effective population size (Li & Durbin, 2011) and genomewide variation in demographic parameters (Roux et al., 2014, 2016). Fortunately, both can be incorporated into flexible hypothesis testing frameworks such as coalescent modelling and Approximate Bayesian Computation (Ewing & Jensen, 2016; Roux et al., 2016). Future focus for estimating and quantifying gene flow should also be placed on approaches that exploit the predictable erosion of genomic blocks introduced by admixture over time (Baird et al., 2003). Such tools, which require accurate phasing, use the distribution of migrant or ‘identity-by-state’ tracts in the genome to estimate the timing and extent of gene flow (Pool & Nielsen, 2009; Harris & Nielsen, 2013). Given the importance of this step to further understand the genomic landscape, Box 3 discusses methods that are useful to test for the presence of gene flow and to infer demographic history in more detail.

Step 3: Capture the best possible picture of the genomic landscape

A wealth of next-generation sequencing approaches exists, nearly all of which have been used in a genome-scan context (Table 1). Relatively inexpensive and easy to apply to nonmodel organisms, reduced-representation techniques such as RAD-seq, RNA-seq and target capture sequencing have quickly gained ground as popular tools for population genomics (Davey et al., 2011; Andrews et al., 2016). These methods can clearly identify patterns of heterogeneity and outlier loci (examples in Table 1). They have also successfully been used to reconstruct population history (Shafer et al., 2015), estimate genomewide recombination rate variation (Roesti et al., 2013) and identify signatures of selection (Roesti et al., 2015). Although de novo assembly of reduced-representation markers can prove useful for identifying outlier loci (Le Moan et al., 2016; Ravinet et al., 2016; Rougemont et al., 2017), ideally a reference genome and genetic map are required to place markers in a genomic context. With such resources, it is possible to test whether divergent loci cluster in the genome (Renaut et al., 2013; Marques et al., 2016), to estimate the size of differentiated regions (Nadeau et al., 2012, 2013) and to ask whether higher differentiation is found predominantly in regions of low recombination (Roesti et al., 2013; Tine et al., 2014; Delmore et al., 2015; Marques et al., 2016). However, reduced-representation sequencing may not always be the ideal choice for identifying barrier loci because of its relatively low genome coverage (e.g. 0.45% of 0.4-Gb three-spined stickleback genome; Hohenlohe et al., 2010). Markers will rarely be the direct targets of selection. In low-recombination regions, physical distance between barrier loci and markers that are outliers is likely to be large; in high-recombination regions, barrier loci are less likely to be detected in the first place as the scale of LD is small. Furthermore, these methods may bias studies in favour of identifying barrier loci with single nucleotide substitutions, overlooking structural variants, rearrangements and changes in genome organization that can only be detected reliably using long-insert mate-pair libraries (Jones et al., 2012) or long-read technologies (English et al., 2012). Most importantly, users should be aware of the pitfalls and biases unique to each different reduced-representation method that may ultimately distort the picture of the genomic landscape, for example null alleles and sequence length bias in RAD-seq (Davey et al., 2013; Gautier et al., 2013; Ravinet et al., 2016) and bias towards conserved genic regions or overexpressed alleles in RNA-seq (Hoban et al., 2016).

Whole-genome resequencing is becoming increasingly affordable as an alternative to reduced-representation approaches and has been used successfully in multiple taxa (see Table 1 for examples). Although it still requires a well-assembled reference, resequencing provides good genomewide coverage and is likely to cover barrier loci, unlike reduced-representation approaches. Furthermore, resequencing can help to identify structural variation, such as duplications, copy number variation, translocations and inversions that prove elusive with a reduced marker set. Hybrid assemblies combining both long- and short-read technologies have proven successful in producing high-quality assemblies incorporating structural variation (English et al., 2012; Wang et al., 2015). Nonetheless, difficult-to-assemble features such as highly repetitive regions are likely to be missed even with new approaches (Hoban et al., 2016). Long-read technologies can also facilitate accurate phasing, an important consideration if downstream analyses will require haplotype information. For those with fewer resources, resequencing might seem daunting. However a feasible option is to sequence a small number of individuals (i.e. one or two) to high depth and many other individuals to much lower depth (Glazer et al., 2015). This hybrid approach also allows high-depth data to be used for other purposes such as demographic inference, genome annotation and assessing structural variation. Pool-seq, i.e. sequencing with barcoding of population samples rather than individuals, can also be used to estimate population allele frequencies and reduce sequencing costs (Schlötterer et al., 2014; Christe et al., 2017).

Step 4: Measure genomic factors that contribute to the differentiation landscape

Measuring factors influencing the genomic landscape is difficult, but not insurmountable. Genomewide recombination rate variation can be documented by mapping in experimental crosses (Roesti et al., 2013) or pedigrees (Kong et al., 2002; Kawakami et al., 2014). LD-based methods using population genetic data are also able to estimate average realized recombination across the population and over time (Tine et al., 2014), which may be more relevant in the landscape context (Smukowski & Noor, 2011). Whichever approach is used, high-density genomic markers and large numbers of individuals are essential as it is clear that recombination rate can vary on a small genomic scale (Roesti et al., 2013; Kawakami et al., 2014). Furthermore, if possible, a comparative recombination mapping approach, that is using all taxa studied, should be taken to account for differences between closely related species (Renaut et al., 2013).

Directly measuring genomewide variation in mutation rate is likely to be more difficult, especially in nonmodel organisms with long generation times. Estimates at putatively neutral sites using phylogenetic methods remain valuable (Kondrashov & Kondrashov, 2010; Scally & Durbin, 2012). However, these estimates are prone to bias depending on the timescale over which they are estimated (Ho et al., 2005; Ho, 2014), and they do not incorporate deleterious or weakly deleterious mutations: that is, they are substitution, not mutation rates. If possible, whole-genome sequencing within families using parent–offspring trios provides a direct measurement of genomewide mutation rate heterogeneity and also allows classification of mutations as adaptive, deleterious or neutral (Francioli et al., 2015). Mutation accumulation lines offer an experimental approach in laboratory-based populations; natural selection is reduced and mutations are allowed to accumulate even if they would otherwise have negative fitness consequences (Ness et al., 2015).

Precise genome annotation, aided with transcriptomic data, should also mean that measures of gene density are feasible for most organisms following genome assembly (Hurst et al., 2002; Al-Shahrour et al., 2010). However, greater effort needs to be made to better annotate regions that are not protein-coding but still play a functional role, for example regulatory regions. Importantly, measuring gene density via annotation may also provide insight into other confounding factors influencing the genomic landscape, potentially overcoming limitations for nonmodel organisms. For example, recombination hot spots may be predicted by identifying transposons and sequence motifs recognized by recombination modifier genes (Myers et al., 2010). Similarly, models using the spatial distribution of CpG dinucleotides, flanking sequence and other mutation rate modifiers could potentially be used to estimate mutation rate variation (Francioli et al., 2015; Ness et al., 2015).

Step 5: Identify selection at barriers, taking modifying factors into account

To identify the signature of divergent selection or barrier loci reliably, controlling for factors that modify or mimic such a signature is essential. Previous work has attempted to do this, at least in part, for example removing the effects of recombination rate variation by either correcting local estimates of differentiation for regional differentiation (Roesti et al., 2012), correlating differentiation with recombination rate (Renaut et al., 2013) or focusing on barrier loci in high-recombination regions (Marques et al., 2016). Clearly much of the focus to date has been on recombination rate variation although mutation rate has been tentatively linked to genomic differentiation using indirect measures such as synonymous divergence (dS; Renaut et al., 2014). Human–chimpanzee sequence divergence models incorporating both mutation and recombination rate variation also show promise in partitioning these effects (Francioli et al., 2015).

Ultimately, the aim should be to infer selection with models that account for variation in multiple confounding factors. It is now possible to detect hard selective sweeps in a single population by including fixed differences with an outgroup to account for mutation rate variation and by scaling the site frequency spectrum by estimates of background selection derived from mutation and recombination rate variation and genome annotation data (Huber et al., 2016). However, this has yet to be extended to cases of divergence with gene flow. Methods using genomewide measures of recombination rate variation and nucleotide diversity to estimate the intensity and timing of selection and gene flow are also now available and can be extended to include background selection (Aeschbacher et al., 2016). Examining the genomic landscape with local estimates of recombination rate, mutation rate and gene density allows us to ask whether we need to invoke divergent selection and gene flow to explain peaks of high differentiation (Cruickshank & Hahn, 2014).

However, such methods can only be used if independent measurements of these factors (see Step 4) are combined with genome-scan data. Long-range haplotype tests such as extended haplotype homozygosity (EHH) or integrated haplotype score (iHS) incorporating linkage disequilibrium among sites offer a model-free alternative for detecting selection and have higher power than site frequency spectrum or differentiation approaches (Sabeti et al., 2002; Voight et al., 2006) although this power is diminished when migration occurs (Vatsiou et al., 2016). Similar statistics such as the singleton density score (SDS) use haplotype information to measure the distribution of distances between singleton mutations, with the expectation that haplotypes undergoing a recent sweep will carry fewer singletons and thus have a greater average distance among them (Field et al., 2016).

Systems of parallel divergence or speciation may also be helpful in separating the effects of various factors (Irwin et al., 2016). For example, when recombination rate variation is correlated among closely related taxa, high differentiation in low-recombination regions that appear in multiple species pairs is more likely to have arisen due to background selection (Burri et al., 2015). This is especially true if contrasts involve different types of barriers to gene flow, and if the same highly differentiated regions occur in comparisons with and without gene flow. However, as a caveat, differentiated regions shared among contrasts may sometimes still be due to loci under divergent selection. Disentangling these explanations is only possible with information on gene density, mutation rate, the types of barriers involved, and the history of gene flow. Nonetheless, even with these data we can still only identify candidate barrier regions: experimental and functional approaches are necessary to identify barrier loci unequivocally.

Step 6: Independent evidence for barrier loci

Crucially, genomic data alone cannot provide conclusive evidence of barrier loci. Disentangling effects is difficult precisely because some modifying factors (e.g. demographic history) are estimated from data used to measure the landscape of differentiation. Even with good genomic evidence of selection on a candidate region, other processes, such as local adaptation following or unrelated to speciation, can be invoked (Cruickshank & Hahn, 2014). For this reason, the search for evidence of selection should extend beyond the genome scan. In principle, there are two ways of obtaining independent evidence for selection; we can either directly test for signatures of selection on a given locus; or we can test for a link between the genotype and the phenotype (i.e. via genetic mapping), and separately test for selection on the phenotype (Table 2). The advantage of the former is that it provides a more direct test of selection; the advantage of the latter is that knowing the associated phenotypic change allows for a complete ‘story’ and a better understanding of the system.

Selection experiments in the field or laboratory, followed by genomewide or candidate locus sequencing, are an excellent example of the former approach (Soria-Carrasco et al., 2014; Egan et al., 2015). Although not possible in all organisms, such studies have already identified loci involved in reproductive isolation and adaptive divergence (Colosimo et al., 2005; Barrett et al., 2008; Arnegard et al., 2014). Genomic data beyond the binary sampling often used for outlier scans can also be very helpful to collect independent evidence of selection. For example, barrier loci are expected to show steep allele frequency clines in regions where gene flow is occurring (e.g. Trier et al., 2014; see also Box 2). Data from instances of parallel divergence may also be used to test whether the same genomic regions show differentiation repeatedly (although see caveats described in Step 5; Table 2).

Various approaches have been used to test associations of candidate loci with divergent phenotypes (or, ideally, phenotypes for which tests of divergent selection have been performed), including QTL crossing experiments, association and admixture mapping. Genetic mapping of phenotypic traits also has the added advantage of testing for genotype-by-genotype interactions (Wu et al., 2007), an important aspect of barrier loci which cannot be identified from genomic differentiation alone. Combining mapping with genome-scan data can help identify when QTL coincide with outlier loci and also provides further evidence that these loci are under selection in the wild (Via & West, 2008; Renaut et al., 2010; Berner et al., 2014). Differences in gene expression between populations at candidate genes under divergent selection might also be informative (Poelstra et al., 2014). In systems where decent genome annotation exists, this may identify associations between candidate loci and known divergent traits (Lamichhaney et al., 2015, 2016).

Nonetheless, the majority of these approaches stop short of directly demonstrating how a barrier allele alters the function to produce phenotypic consequences and ultimately results in reproductive isolation (Seehausen et al., 2014). In some cases, molecular assays of protein function are possible; but often conclusive evidence is only really possible using transgenic or gene interference methods which to date have largely been limited to model organisms such as Drosophila (Thomae et al., 2013; Satyaki et al., 2014; Phadnis et al., 2015). With the rapid adoption of CRISPR, a method applicable to a much wider range of organisms, transgenic experiments are likely to become an important part of speciation research (Bono et al., 2015). Gene insertion, knockouts and reciprocal transplant experiments, for example, will be able to provide direct evidence of barrier nucleotide function in nonmodel organisms (Bono et al., 2015).

Concluding remarks

The genomic landscape of speciation is, like the process itself, complex. A wide variety of processes and mechanisms can shape differentiation and divergence between species pairs, beyond divergent selection and gene flow. Like a true physical landscape, determining which processes have played an important role in its formation is difficult but not insurmountable. Accounting for modifying factors in genome-scan data will undoubtedly require sophisticated approaches but will also need additional evidence such as independent measures of recombination and mutation rate variation, and, maybe most importantly, independent evidence for selection (e.g. from experiments). The field of speciation genomics is already progressing towards disentangling modifying factors and directly measuring selection on candidate loci in the field and the laboratory, with a greater emphasis on experimental design and new analysis methods. Furthermore, with new molecular tools and more advanced sequencing technologies on the horizon, conclusive evidence for barrier loci will likely become easier to achieve for those working outside the realm of model species. We look forward to further developing our understanding of how genomic heterogeneity evolves and to seeing how such knowledge can be used to identify loci involved in reproductive isolation with greater precision and reliability.

Acknowledgments

Many of the ideas for this review were first formulated from discussions between co-authors during our symposium on ‘The genomic landscape of speciation’ at ESEB 2015, Lausanne, Switzerland. We were kindly sponsored by Floragenex, Oregon, USA, and also by Stab Vida, Portugal. We are grateful to Mike Ritchie, Jeffrey Feder and an anonymous reviewer for their comments on an earlier draft of this manuscript. Mark Ravinet was funded by a JSPS Postdoctoral Fellowship for Foreign Researchers and by the Norwegian Research Council. RF was funded by FCT under the Programa Operacional Potencial Humano – Quadro de Referência Estratégico Nacional from the European Social Fund and the Portuguese Ministério da Educação e Ciência (SFRH/BPD/89313/2012, PTDC/BIA-EVF/113805/2009 and FCOMP-01-0124-FEDER-014272) as well as by the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 706376. JG was funded by a postdoctoral fellowship from Xunta de Galicia (Modalidade B). AMW and RKB are funded by NERC. BM and MRaf are supported by the Centre for Marine Evolutionary Biology, University of Gothenburg, Sweden. MRaf is additionally supported by the Adlerbert Research Foundation. NB is funded by ANR (HYSEA project, ANR-12-BSV7-0011).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.