A survey of genome-wide single nucleotide polymorphisms through genome resequencing in the Périgord black truffle (Tuber melanosporum Vittad.)
Abstract
The Périgord black truffle (Tuber melanosporum Vittad.), considered a gastronomic delicacy worldwide, is an ectomycorrhizal filamentous fungus that is ecologically important in Mediterranean French, Italian and Spanish woodlands. In this study, we developed a novel resource of single nucleotide polymorphisms (SNPs) for T. melanosporum using Illumina high-throughput resequencing. The genome from six T. melanosporum geographical accessions was sequenced to a depth of approximately 20×. These geographical accessions were selected from different populations within the northern and southern regions of the geographical species distribution. Approximately 80% of the reads for each of the six resequenced geographical accessions mapped against the reference T. melanosporum genome assembly, estimating the core genome size of this organism to be approximately 110 Mbp. A total of 442 326 SNPs corresponding to 3540 SNPs/Mbps were identified as being included in all seven genomes. The SNPs occurred more frequently in repeated sequences (85%), although 4501 SNPs were also identified in the coding regions of 2587 genes. Using the ratio of nonsynonymous mutations per nonsynonymous site (pN) to synonymous mutations per synonymous site (pS) and Tajima's D index scanning the whole genome, we were able to identify genomic regions and genes potentially subjected to positive or purifying selection. The SNPs identified represent a valuable resource for future population genetics and genomics studies.
Introduction
Single nucleotide polymorphisms (SNPs) have attracted a great deal of interest in the scientific community (Ganal et al. 2009). Already widely used for human and plant genetics, due to technological developments and the subsequent reduced costs of high-throughput sequencing technologies, large-scale SNP identification is also available for filamentous fungi (e.g. Fusarium graminearum, Cuomo et al. 2007; Coccidioides spp., Neafsey et al. 2010; Leptographium longiclavatum, Ojeda et al. 2014; Blumeria graminis, Wicker et al. 2013; Rhizoctonia solani, Hane et al. 2014). The value of SNPs in comparison with microsatellite markers lies in their abundance throughout the genome, their biallelic nature and their high potential for automation (Brumfield et al. 2003). In addition, they are easy to study, follow relatively robust mutation models and are easily genotyped in large panels. SNPs can be used in population analyses for studying demographic and historical patterns based on large numbers of samples. Compared to microsatellites, which are essentially neutral markers that do not permit a high density, an advantage of surveying SNPs throughout the genome is the higher probability of identifying adaptation signatures. For example, a large-scale SNP analysis uncovered an adaptation signature for temperature in Neurospora crassa (Ellison et al. 2011).
In Tuber melanosporum, the genetic diversity and phylogeography have been investigated with different molecular markers such as randomly amplified polymorphism DNA, microsatellites, SNPs in the internal transcribed spacer (ITS) of ribosomal DNA and inter-simple sequence repeats (ISSR; Bertault et al. 1998; Murat et al. 2004; Riccioni et al. 2008; García-Cunchillos et al. 2014). These studies pointed to an important effect of the last glaciations (from 120 000 to 11 000 years ago; Van Andel & Tzedakis 1996) on the truffle population structure. For example, two putative post-glacial recolonization routes were hypothesized using 10 SNPs in ITSs (Murat et al. 2004). Through the use of microsatellites and ISSR fingerprinting, it has also been suggested that glacial refuges exist in Italy and Spain (Riccioni et al. 2008; García-Cunchillos et al. 2014).
Following the publication of the T. melanosporum genome sequencing project (Martin et al. 2010), this species became one of the model species for studying ectomycorrhizal ascomycetes (Kües & Martin 2011). Ectomycorrhizal fungi are an important group of fungi, because they promote the growth of trees in forests and woodlands by providing the trees with water and nutrients (Smith & Read 2010). Using the T. melanosporum genome, highly polymorphic microsatellite makers were developed and used to characterize small-scale spatial genetic diversity in two truffle orchards, identifying a pronounced spatial genetic structure with numerous small-sized genets (Murat et al. 2013b). These results suggested that T. melanosporum relies heavily on sexual reproduction. Microsatellite-based population genetic analyses have allowed for the investigation of a small proportion of the T. melanosporum genome, but with a lower probability of detecting the genomic regions involved in the species' adaptation. Using the whole T. melanosporum genome in combination with high-throughput sequencing technologies, large-scale SNP surveys now make it possible to perform an exhaustive investigation of the genomic variation.
The aim of this study was to assess the overall genetic diversity of T. melanosporum by identifying and mapping the SNPs in resequenced genomes. The genomes of six geographical accessions of T. melanosporum were sequenced using Illumina technology and compared to strain Mel28 as the reference genome (Martin et al. 2010). To improve the chances of finding genetic polymorphisms, the resequenced geographical accessions were from samples in different populations within the northern and southern geographical limits of the species distribution. This SNP resource will be useful for more in-depth investigations of T. melanosporum population structure, gene flow and putative ecotype identifications, as well as of selected genes and genomic regions.
Materials and Methods
Sampling and DNA extraction
Tuber melanosporum Vittad. (Ascomycota, Pezizomycotina, Pezizomycetes, Pezizales, Tuberaceae) is native to France, Italy and Spain. Our sampling strategy aimed to cover the natural geographical range of the species as well as the different climates (Mediterranean and continental) where this truffle is produced. The Mel28 isolate (referred to as France-Pro in this study) used for sequencing the reference genome was harvested in southern France (Saint Rémy de Provence, Bouches du Rhône, France; Martin et al. 2010). Three T. melanosporum were harvested from France (Alps, Burgundy and Alsace), one from Italy (Umbria) and two from Spain (Castilla Leone; Fig. S1; Table 1). Within a few days of harvesting, each ascocarp was shipped to the laboratory, thoroughly washed and the inner section (i.e. gleba) conserved at −20 °C pending the DNA extraction.
Sample name | Code in manuscript | Locality | Region | Country | Climate | Sequence origin |
---|---|---|---|---|---|---|
091215-1 | Spain-1 | Sierra de Alcaraz | Castilla Leone | Spain | Mediterranean | This study |
091215-4 | Spain-2 | Sierra de Vianos | Castilla Leone | Spain | Mediterranean | This study |
100104-1 | France-Bur | Courban | Burgundy | France | Continental | This study |
100120-1 | France-Als | Rouffach | Alsace | France | Continental | This study |
100122-1 | Italy | Perugia | Umbria | Italy | Mediterranean | This study |
100303-4 | France-Alp | Chorges | Provence-Alpes-Côtes d'Azur | France | Alps (> 1000 m altitude) | This study |
mel28 | France-Pro | St Rémy de Provence | Provence-Alpes-Côtes d'Azur | France | Mediterranean | Martin et al. (2010) |
Total DNA was extracted from 500 mg of gleba using a modified CTAB (cetyl trimethyl ammonium bromide) protocol. After grinding the samples in liquid nitrogen, they were incubated for 30 min at 65 °C in 2.5 volumes of buffer A (0.35 m sorbitol; 0.1 m Tris-HCl, pH 9; and 5 mm EDTA, pH 8), 2.5 volumes of buffer B (0.2 m Tris-HCl, pH 9; 50 mm EDTA, pH 8; 2 m NaCl; and 2% CTAB) and 1 volume of buffer C (5% of Sarkosyl; N-lauroylsarcosine sodium salt) in 50-mL Falcon tubes. After the incubation, a 0.33 volume of potassium acetate (5 m) was added, and the tubes were incubated for 30 min on ice to precipitate the polysaccharides. After centrifugation at 5000 × g for 20 min, the supernatant was purified with 1/10 volume of ammonium acetate (3 m) and 1 volume of chloroform:isoamyl alcohol (24:1) in a Falcon tube and centrifuged at 4000 × g for 10 min. The aqueous phase was transferred to a Nalgene tube (Fisher Scientific, France) and incubated with 100 μL RNase A (10 mg/mL) for 30 min at 37 °C. The DNA was precipitated with a 1/10 volume of ammonium acetate (3 m) and 1 volume of isopropanol at room temperature for 5 min and centrifuged at 10 000 × g for 10 min. The pellet was suspended in 2 mL of QBT buffer (0.75 m NaCl; 50 mm MOPS, pH 7.0; 15% isopropanol; and 0.15% Triton X-100) and purified using Genomic-tip 100/G columns (Qiagen Cat# 10243) following the manufacturer instructions with the exception of the QC buffer (1.35 m NaCl; 50 mm MOPS, pH 7.0; 15% isopropanol). The higher NaCl (1.35 m instead of 1 m) concentration in the QC buffer allowed for the exclusion of small DNA fragments from the column. The purified DNA was then concentrated by precipitation with 1/10 volume of ammonium acetate (3 m) and 1 volume of isopropanol at room temperature for 5 min and centrifuged at 10 000 × g for 10 min. After discarding the supernatant, the pellet was resuspended in 100 μL of TE buffer and stored at −20 °C.
Riccioni et al. (2008) showed that the gleba of T. melanosporum is formed by a haploid maternal mycelium. The DNA isolated from each ascocarp by the described protocol is therefore expected to correspond to a haploid mycelium, as no disrupted spores were observed when checked under a microscope (data not shown).
Whole-genome shotgun sequencing and mapping
Each of the six geographical accession DNAs was sequenced in one lane of an Illumina Genome Analyzer (GAII) at the Beckman Genomics facilities (Brea, CA, USA). Sequencing produced approximately 1.1 Gb of 76-bp single-end reads per sample, and the sequencing depth ranged from 21- to 24-fold (Table 2). The raw data reads can be accessed in the sequence reads archive at the National Center for Biotechnology (NCBI) under the Accession No SRP044130.
Samples | Read number | Total mapped reads after filteringa | Reads mapping to multiple locationsb | Genome reference coveragec | Number of genesd | |||
---|---|---|---|---|---|---|---|---|
Read Number | % | Number of reads | % | Number of bp | % | |||
Spain-1 | 39 275 496 | 28 744 980 | 73.19 | 2 376 408 | 6.05 | 113 139 168 | 91.5 | 9816 |
Spain-2 | 38 003 850 | 29 780 727 | 73.62 | 2 419 174 | 6.37 | 113 282 964 | 91.7 | 9808 |
France-Bur | 38 921 450 | 31 145 093 | 80.02 | 2 427 651 | 6.24 | 114 066 611 | 92.3 | 9800 |
France-Als | 34 575 334 | 26 833 541 | 77.61 | 2 073 232 | 6.00 | 114 998 792 | 93.1 | 9831 |
Italy | 39 184 077 | 29 780 081 | 76.00 | 2 323 320 | 5.93 | 114 004 497 | 92.3 | 9807 |
France-Alp | 38 597 308 | 30 651 292 | 79.41 | 2 474 195 | 6.41 | 113 860 556 | 92.2 | 9792 |
- a Excluding low-quality reads and reads mapping to multiple locations.
- b These reads were eliminated for the SNP identification.
- c Number of base pairs and the percentage of the reference genome mapped by reads. Excluding the Ns, the reference genome is composed of 123 535 220 bp.
- d Number of gene models mapped for each genome. A gene model was considered present if at least 60% of its sequence was covered by reads. The total number of gene models in the reference genome is 9952 (9765 are unique to France-Pro).
The raw reads for each genome were aligned to the France-Pro reference genome available at the Institut National de la Recherche Agronomique (INRA) Tuber genome database (http://mycor.nancy.inra.fr/IMGC/TuberGenome/download.php?select=fast) using the Burrow-Wheeler Aligner (bwa) software, version 0.7.3a (Li & Durbin 2009); with the exception of the number of mismatches between a read and the reference genome, which was set to two, the aln/samse algorithm and default parameters were used. As BWA generates a read mapping quality in phred scaled (MAPQ) according to the read quality, the raw reads were not quality filtered before mapping. To avoid low-quality mapping, only reads mapped with an MAPQ above 25 were considered for analysis with SAMtools (v. 0.1.18; Li et al. 2009). This stringent parameter eliminated the reads of low sequencing quality and those mapped at several genomic locations, thereby avoiding problems in the SNP calling due to the higher proportion of repeated sequences such as transposable elements (TE) in the T. melanosporum genome (Martin et al. 2010). The genomic regions without mapped reads were assigned to the different genomic compartments (i.e. genes, TE and intergenic regions) defined by Martin et al. (2010). We considered genes with more than 60% of their sequences without mapped reads as missing.
In this study, we produced an updated gene model repertoire of the T. melanosporum genome. Indeed, transcriptomic analyses have suggested that the 7496 high-confidence protein-coding genes supported by either sequence similarity, the occurrence of Pfam or KOG domains, or oligoarray expression data (Martin et al. 2010; available at http://mycor.nancy.inra.fr/IMGC/TuberGenome/index.html) omitted several expressed genes (Tisserant et al. 2011; A. Kohler, E. Tisserant and F. Martin, unpublished data). Moreover, we could not exclude that gene model families were considered as repeated sequences and excluded from this high-confidence protein-coding gene repertoire. To update the gene model repertoire, we began with the initial 12 826 putative gene models identified by GAZE and discarded those gene models that (i) overlapped with known TEs (Martin et al. 2010), (ii) had more than 40% unknown bases (N), (iii) had homology with Repbase (Jurka et al. 2005) or (iv) were <20 amino acids in length. The 1315 genes that were manually curated (Martin et al. 2010) served as a validation set. Twelve percentage of the T. melanosporum genome was covered by uncategorized repeated sequences (so-called no cat) lacking homology with known TE families that could code for either T. melanosporum-specific TE or proteins belonging to orphan multigenic families. Gene models overlapping these no cat sequences were retained in the new set of 9952 gene models (Table S1). The expression levels on NimbleGen microarrays and RNAseq for each of the genes in the repertoire were determined for the ectomycorrhizae, free-living mycelium and ascocarps (Martin et al. 2010; Tisserant et al. 2011). The homology of each gene model searched against the NCBI nr database (September 2013) and UniProt (UniProtKB/Swiss-Prot of September 2013) was computed using blastp (v2.2.28+) with an e-value threshold of 10−5 (Altschul 1990). Each gene was also analysed for Pfam motifs using the hmmscan command of the HMMER package (Eddy 2011).
SNP calling and localization in the genome
SNP calling was performed with two different methods: (i) BWA for the alignment and SAMtools (Li et al. 2009) for the SNP calling (referred to as the BWA/SAMtools method) and (ii) the clc Genomics Workbench version 6.6 (http://www.clcbio.com) for both the alignment and calling (referred to as the CLC method).
For the BWA/SAMtools method, a pile-up file (i.e. file describing the mapping results information at each chromosomal position) was created with the SAMtools mpileup command using the bam alignment output generated by BWA (see above). The SNP calling was filtered with the vcfutils script (available with SAMtools); to be validated, each SNP was required to be supported by at least ten reads, and the root mean square (RMS) of the mapping quality of the SNP position had to be ≥25.
For the CLC method, the reads were mapped using a global alignment with the length fraction set to 1, the similarity fraction set to 0.97 and nonspecific reads ignored. All other parameters were set by default (http://www.clcbio.com/files/usermanuals/CLC_Genomics_Workbench_User_Manual.pdf). The SNPs were called by the quality-based variant detection module ignoring variants in nonspecific regions and using default parameters (i.e. minimum coverage of 10 reads). The maximum expected variation (ploidy) was set to 1, because haploid genomes had previously been sequenced (see above).
For both pipelines (BWA/SAMtools and CLC), we created a file for each sequenced genome with the SNPs localized on the reference genome assembly (France-Pro). The two sets of SNPs identified for each genome were compared, and only the SNPs called by both software methods were considered for further analyses.
All of the SNPs identified by aligning the six geographical accessions against the reference genome were compiled to generate a gff-formatted file available in DRYAD (doi:10.5061/dryad.9gk52). The SNPs were localized according to the new protein-coding gene catalogue defined in this study (see above) and repeated sequences library defined in Martin et al. (2010) using python scripts available at the INRA Tuber genome portal using the following link (http://mycor.nancy.inra.fr/IMGC/TuberGenome/download.php?select=anno).
Polymorphism indices and detecting selection pressure
The level of polymorphism between genomes was assessed by the π index (Nei & Li 1979) calculation in a sliding window of 10 kb throughout the entire reference genome using egglib version 2.1.6 (De Mita & Siol 2012). The π index corresponds to the average number of nucleotide differences per site between two DNA sequences from the sample population. In the sliding windows of 10 kb throughout the entire genome, we also computed the Tajima's D (Tajima 1989) and Waterson Theta (Watterson 1975) values. The ratios of nonsynonymous mutations per nonsynonymous sites (pN) and synonymous mutations per synonymous sites (pS) in gene models were calculated to assess the mutations for deviation from neutral evolution. Positive Tajima's D values are typically attributed to diversifying, balancing or positive selection, whereas negative values are generally attributed to positive or purifying selection (Weedall & Conway 2010); they have an expected normal distribution between −2 and +2 for a 95% confidence interval (Tajima 1989; Carlson et al. 2005). Therefore, in this study, we considered values >+2 or less than −2 as significant. The Tajima's D value was calculated for all (i) of the gene models and (ii) in a sliding window of 10 kb throughout the whole genome (including the coding and noncoding regions). Only gene models with at least five SNPs in their coding regions were considered for the pN/pS ratio calculation. A comparison between the two indices allowed us to identify candidate gene models subject to positive selection if pN/pS was >1 and purifying selection if pN/pS was <1.
Phylogenetic reconstruction and divergence time among geographical accessions
To construct a phylogeny of the seven T. melanosporum samples, the 60 507 SNPs present in the intergenic regions free of selective pressure (excluding repeated sequences and genomic regions with a Tajima's D index above 2) were selected. A maximum-likelihood phylogenetic tree was built using the default parameters of PhyML (Guindon et al. 2010) with 100 bootstrap replicates. To investigate the minimal number of SNPs suitable for a population genetic analysis, subsets of 10, 100, 1000, 5000, 10 000, 15 000, 20 000, 25 000, 30 000, 35 000, 40 000 and 50 000 SNPs were randomly selected 100 times among the 60 507 SNPs free of selection, and 100 maximum-likelihood phylogenetic trees were built using the PhyML default parameters. The Robinson–Foulds distance (Robinson & Foulds 1981) was used to measure the distance between each generated phylogenetic tree and the reference tree generated for the whole set of 60 507 SNPs using RF.dist in the phangorn R library (Schliep 2011). For each subset of SNPs, the number of trees identical to the reference tree was calculated as the number with a Robinson–Foulds distance equal to 0, meaning that the two trees are identical.
The divergence time estimates were performed with the 60 570 SNPs present in the intergenic regions free of selective pressure as described by Wicker et al. (2013). For the calculation, we assumed that all of the SNPs were present in regions that had accumulated mutations at the same rate, and we used a rate of 1.3 E−8 (± 2.29 E−9) substitutions per site per year as originally proposed by Ma & Bennetzen (2004) for rice. This choice was justified by the common use of this mutation rate for fungi as was done by Wicker et al. (2013) to estimate the divergence among isolates of B. graminis, another ascomycete fungus. It is also in the range of mutation rates (0.09 E−8 to 1.67 E−8 substitutions per site per year) proposed by Kasuga et al. (2002) for fungi. The number of SNPs in the genomic regions free of selection pressure was used to estimate the time since the most recent common ancestor (MRCA) for all geographical accessions. Using all of the SNPs in the genomic regions free of selective pressure, a Bayesian phylogenetic analysis was conducted with 5 000 000 generations, a sampling for each 1000 generations, and a burn-in value of 1250 with beast version v1.7.5 (Drummond & Rambaut 2007) using the Hasegawa–Kishino–Yano (HKY) DNA substitution model. The estimated age of the MRCA for the tree was used to estimate the different node ages using a relaxed clock model with uncorrelated exponential prior distribution levels. The exponential relaxed clock model was chosen because it had been used previously for Tuberaceae by Bonito et al. (2013).
Results and Discussion
Geographical accessions: resequencing and read mappings
Among the approximately 34–39 million reads generated from each genome, 73–80% mapped to a unique position against the reference genome using BWA (Table 2). The raw-read data were deposited in the NCBI sequence reads archive under Accession No SRP044130. The average sequencing quality was good (quality score ≥20), as illustrated in Fig. S2, and the low-quality reads were discarded from the mapping for future analyses. The read coverage throughout the genome was continuous with an average depth of approximately 20× and did not reveal any important differences between the protein-coding and repeated sequences (Fig. S3). This likely results from the stringent parameters used for the read mapping and the post-processing step in which the reads mapped at several genomic locations were eliminated. Indeed, when multiple mapping was possible, it was shown that an increased density of mapped reads correlates with the location of repeated sequences in Pyrenophora tritici-repentis (Manning et al. 2013).
Between 91% (Spain-1) and 93% (France-Als) of the France-Pro (Mel28) reference genome was covered by reads (Table 2), and the T. melanosporum core genome was estimated at approximately 110 Mbp. The small proportion of the reference genome not covered by reads corresponds in large part to repeated sequences (approximately 80%; Fig. S4). TEs, primarily gypsy retrotransposons, were over-represented in the genomic regions not covered (Fig. S4). A total of 187 of the 9952 protein-coding gene models identified in the reference genome were found in the uncovered genomic regions (Table 2), with a maximum of 160 genes found for the France-Alp geographical accession. Among these 187 genes, microarray and RNAseq expression data showed that 73 and 112, respectively, were expressed in at least one tissue (Table S1; Martin et al. 2010; Tisserant et al. 2011). The regions that were not covered by mapped reads may correspond to (i) regions absent in the resequenced genomes; (ii) regions present in the resequenced genomes, but highly polymorphic, thus preventing proper read mapping; or (iii) regions rich in repeated sequences that prevent nonambiguous mapping. Unfortunately, the sequencing strategy utilized (i.e. single-end sequencing) was not suitable to address these genomic regions in more detail, because the de novo assembly of new resequenced genomes was not possible.
The truffle is a heterothallic species harbouring one of two mating type idiomorphs (i.e. MAT1-1 or MAT1-2) in its haploid genome (Martin et al. 2010), and the MAT1-2 idiomorph is present in Scaffold 247 of the France-Pro reference genome (Rubini et al. 2011). The resequenced genomes from the geographical accessions Spain-1, France-Als and France-Alp contained reads matching the MAT1-2 idiomorph in the France-Pro reference genome (Fig. S3B), while the genomes of the geographical accessions Spain-2, France-Bur and Italy lack these sequences, suggesting they harbour the MAT1-1 idiomorph. This was confirmed by mapping the Illumina reads against the known sequence of the MAT1-1 idiomorph (data not shown). These results confirmed that either of the two mating type idiomorphs is present in the T. melanosporum haploid genome (Rubini et al. 2011).
SNP identification
The SNP calling performed with the BWA/SAMtools and CLC Genome Workbench programs produced similar results (Fig. 1), with 93% (442 326 SNPs corresponding to 3540 SNPs/Mbps) of those called by BWA/SAMtools also being called by CLC. As proposed by Zhan et al. (2011), only SNPs called by both methods were retained. The gff file with this SNPs resource can be downloaded in DRYAD (doi:10.5061/dryad.9gk52) and in our institution website following this link (http://mycor.nancy.inra.fr/IMGC/TuberGenome/download.php?select=anno).

A comparison of each of the T. melanosporum resequenced genomes to the France-Pro reference genome identified between 108 112 and 198 788 SNPs, with the density ranging from 865 to 1591 SNPs/Mbp for the geographical accessions France-Als and Spain-2, respectively (Table 3). According to Fumagalli et al. (2013), SNPs identified using high-throughput sequencing technologies should be considered with caution, particularly when the sequencing depth is low (<10×). Here, the SNPs retained for analyses were identified using two different programs, and together with the stringent mapping that limited the multiple mapping of reads and the 10× read depth required for SNP identification, likely limited spurious SNPs (Li et al. 2009; Fumagalli et al. 2013). However, additional whole-genome sequencing or the targeted sequencing of SNP-rich regions will be conducted to experimentally confirm the existence of these in silico SNPs.
Samples | Introns | Exons | Untranslated regions (UTRs) | Repeated sequencesa | Other genomic regions | Total | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Number | SNP/Mbp | Number | SNP/Mbp | Number | SNP/Mbp | Number | SNP/Mbp | Number | SNP/Mbp | Number | SNP/Mbp | |
Spain-1 | 2727 | 354 | 1816 | 155 | 102 | 179 | 139 824 | 1961 | 22 550 | 670 | 167 019 | 1337 |
Spain-2 | 3144 | 409 | 2050 | 175 | 119 | 209 | 165 981 | 2327 | 27 494 | 817 | 198 788 | 1591 |
France-Bur | 2569 | 334 | 1529 | 131 | 84 | 148 | 110 806 | 1554 | 18 295 | 544 | 133 283 | 1067 |
France-Als | 1948 | 253 | 1282 | 110 | 69 | 121 | 89 413 | 1254 | 15 400 | 458 | 108 112 | 865 |
Italy | 2288 | 297 | 1478 | 126 | 91 | 160 | 106 57 | 1494 | 16 570 | 492 | 126 997 | 1016 |
France-Alp | 1952 | 254 | 1332 | 114 | 77 | 135 | 98 933 | 1387 | 15 363 | 456 | 117 657 | 942 |
Totalb | 6795 | 883 | 4501 | 385 | 252 | 443 | 374 268 | 5248 | 56 510 | 1679 | 442 326 | 3540 |
- a Repeated sequences comprising known transposable elements and uncategorized elements.
- b The total number of SNPs excluding redundancy.
The SNP density in filamentous fungi varied from 291 to 14 005 SNPs/Mbp for the Fusarium graminearum (Cuomo et al. 2007) and Rhizoctonia solani (Hane et al. 2014) geographical accessions, respectively (Table 4). The differences in the nucleotide polymorphism levels observed could reflect differences in the demographic history of these species (e.g. reduction of polymorphism due to population bottlenecks), evolutionary trends related to the lifestyles of these fungi (e.g. for pathogenic species) as well as their respective ratios between sexual and asexual reproduction. T. melanosporum, with 3540 SNPs/Mbp, has a genetic diversity level in the lower range for filamentous fungi (Table 4), although we are aware that the parameters used to call the SNPs and the number of samples differed between the studies.
Species | Phylum | Number of sequenced strains | SNPs/Mbp | Reference |
---|---|---|---|---|
Blumeria graminis | Ascomycota | 2 | 1000 | Hacquard et al. (2013) |
Coccidioides immitis | Ascomycota | 10 | 5251 | Neafsey et al. (2010) |
Coccidioides posadasii | Ascomycota | 10 | 9227 | Neafsey et al. (2010) |
Fusarium graminearum | Ascomycota | 2 | 291 | Cuomo et al. (2007) |
Leptographium longiclavatum | Ascomycota | 71 | 975 | Ojeda et al. (2014) |
Neurospora crassa | Ascomycota | 48 | 3375 | Ellison et al. (2011) |
Tuber melanosporum | Ascomycota | 7 | 3540 | This study |
Lentinula edodes | Basidiomycota | 2 | 4629 | Au et al. (2013) |
Melampsora larici-populina | Basidiomycota | 15 | 6051 | Persoons et al. (2014) |
Puccinia graminis | Basidiomycota | 1a | 1843 | Duplessis et al. (2011) |
Puccinia striiformis | Basidiomycota | 1a | 5980 | Cantu et al. (2013) |
Rhizoctonia solani | Basidiomycota | 2 | 14005 | Hane et al. (2014) |
Rhizophagus irregularis | Glomeromycota | 6 | 321 | Lin et al. (2014) |
- a For these species, the SNPs were identified in one dikaryotic strain.
SNPs are not distributed equally in the genome
The polymorphism index (π; Nei & Li 1979) was calculated along the genome in 10 Kb sliding windows, which showed some genomic regions to be more polymorphic than others (Fig. 2). Indeed, the SNPs were not distributed equally in the genome, and as expected, they occurred more frequently in repeated sequences than in protein-coding genes (Table 3; Figs. 2 and S5). Most of the SNPs were found in repeated sequences (84.6%) that represented 57.7% of the T. melanosporum genome (Martin et al. 2010). The SNPs were more frequent in gypsy retrotransposons than in DNA transposons (Fig. S5). A bias in the SNP distribution was also observed in the F. graminearum fungal genome, where 50% of the SNPs were present in 13% of the genome (Cuomo et al. 2007). Large blocks of regions rich in SNPs were also identified in the B. graminis genome (Hacquard et al. 2013; Wicker et al. 2013) and in the poplar leaf rust Melampsora larici-populina, in which a large portion of the variants were identified in coding sequences (Persoons et al. 2014).

Several mechanisms are known to inactivate transposons in filamentous fungi (Murat et al. 2013a), and some such as repeat-induced point mutations (RIPs) introduce mutations in these sequences (Selker et al. 1987). While genes involved in RIPs were not identified in the T. melanosporum genome (Martin et al. 2010), a strong preference for transitions in the CpG dinucleotide was observed by Clutterbuck (2011). Recently, Montanini et al. (2014) found that the methylation pattern in T. melanosporum selectively targets TEs rather than genes, and their results strongly favour methylation induced premeiotically (MIP) as the process responsible for TE silencing in T. melanosporum. Interestingly, MIP can increase the mutation rate of the methylated cytosines, as documented for mammalian DNA (Kricker et al. 1992). The SNPs were more frequently found in gypsy retrotransposons. Interestingly, these elements colonized the T. melanosporum genome several millions years ago (Martin et al. 2010), and their SNP richness can be explained by their old age, as SNPs in these regions tend to accumulate due to DNA decay (Lisch & Bennetzen 2011). The mapping of reads in multiple locations was low (approximately 6%; Table 2), although almost 60% of the T. melanosporum genome corresponds to repeated sequences, suggesting that the different TE copies are not conserved.
SNPs in gene models
A total of 903 protein-coding genes presented with more than two SNPs in their untranslated regions (UTRs), introns and/or exons. Among these, 742 had SNPs in their coding regions, including 584 nonsynonymous mutations (Tables S1 and S2). The 20 gene models with the highest number of SNPs (>10) in their coding regions are shown in Table S3. Most have sequence similarities in the DNA databases, and four are paralogues coding for the same HET-E-1 protein. Five genes were not expressed in any tissue, although in comparison with the free-living mycelium, four were upregulated in fruiting bodies (coding for an alpha-/beta-glucosidase, an alpha-glucosidase 2, a methylene-tetrahydrofolate reductase 2 and an ankyrin repeat-containing protein), as were three in the ectomycorrhizal root tips (coding for a vegetative incompatibility protein HET-E-1, an alpha/beta-glucosidase and an alpha-glucosidase 2). The same putative alpha-/beta-glucosidase and alpha-glucosidase 2 were upregulated in both tissues (Table S3). For the genes with SNPs, no enrichment of specific metabolic pathways was detected (data not shown).
When compared with the gene contents of the 30 other ectomycorrhizal genomes sequenced under the framework of the Mycorrhizal Genome Initiative (Martin et al. 2011), T. melanosporum has a restricted gene content (9952 protein-coding genes). Indeed, the number of gene models per species ranged from 30 282 for Rhizophagus irregularis down to 9952 for T. melanosporum (http://genome.jgi-psf.org/Mycorrhizal_fungi/Mycorrhizal_fungi.info.html?core=genome&query=%22groups:Mycorrhizal_fungi%22&searchType=Keyword). Thus, T. melanosporum genes may have experienced purifying selection at a higher rate in comparison with the species with larger gene repertoires and a higher number of gene families (i.e. functional redundancy). In T. melanosporum, 2.6% of the SNPs were found in protein-coding genes and 1% in coding sequences, which was less than found in B. graminis and the Coccidioides spp. Wicker et al. (2013) found between 3.7% and 3.9% of the SNPs in the coding regions for B. graminis, and Neafsey et al. (2010) identified 33–36% of the SNPs in genes (UTRs, introns and exons) when they compared the genomes of C. immitis and C. posadasii. This suggests that the limited gene repertoire of T. melanosporum is associated with higher functional constraints and consequently presents a lower rate of genetic variation.
Detecting selection pressure
Two approaches were used to identify selection signatures. First, the rates of nonsynonymous mutations per nonsynonymous site (pN) and synonymous mutations per synonymous site (pS) were calculated for the 119 genes with five or more SNPs. Of those, 18 genes had a pN/pS ratio >1 and 9 had only nonsynonymous mutations (20 expressed in at least one tissue), suggesting they were under positive selection. On the other hand, 78 genes had a pN/pS ratio <1 (68 expressed in at least one tissue), suggesting they were under purifying selection (Tables S4 and S5). The second approach relies on the Tajima's D statistic computed on either the gene models or using a whole-genome sliding window. Here, we considered values >+2 and lower than −2 as significant. When calculated for the gene models, only four had a Tajima's D value >+2, suggesting the existence of a signature for balancing or positive selection (Table S1). These genes, coding for the vegetative incompatibility protein HET-E-1, an NADH-ubiquinone oxidoreductase, a vacuolar protein and a protein kinase, were expressed in the different tissues (Table S1). When the Tajima's D statistic was calculated by scanning the whole genome along a sliding window, 36 genomic regions with a Tajima's D value >+2 were identified (Fig. 2). Thirty-one gene models were present in these 36 genomic regions (Table S6), including the previously identified vegetative incompatibility protein HET-E-1.
The positive Tajima's D values can result not only from balancing selection, but also from population structure and moderately intense bottlenecks (i.e. a reduction in the size of the population; Biswas & Akey 2006). In addition to significant population structure effects (Murat et al. 2004; Riccioni et al. 2008; García-Cunchillos et al. 2014), a population bottleneck due to the last glaciation has also been proposed for T. melanosporum (Bertault et al. 1998). The six T. melanosporum geographical accessions were harvested from different populations, as demonstrated by the phylogeographical analysis (see below). Therefore, we cannot exclude that the high positive Tajima's D values observed resulted from a population bottleneck and/or population structure rather than from balancing selection. These results are preliminary and need to be confirmed by sequencing a larger number of genomes, but they open the way for future investigations of truffle adaptation to environmental stresses.
Phylogeography and divergence time among geographical accessions
The genomic regions putatively free of selection covered 36.6 Mbp, for a total of 60 507 SNPs. This set of SNPs can be downloaded in DRYAD (doi:10.5061/dryad.9gk52) and in our institution website following this link (http://mycor.nancy.inra.fr/IMGC/TuberGenome/download.php?select=anno).
The unrooted maximum-likelihood phylogenetic tree clustered together samples according to their geographical origin with a cluster comprised of the northern France samples (France-Als and France-Bur), a cluster grouping samples from south-eastern France (France-Alp and France-Pro) and Italy, and another cluster with the Spanish samples (Fig. S1B). This ability to identify the geographical origin of truffles harvested in natural populations using SNPs is currently being used to design diagnostic SNP arrays for geographical certification. As highlighted by Davey et al. (2011), genotyping SNPs across targeted populations is now facilitated by the advent of high-throughput SNP arrays. Indeed, depending on the sample size and the number of SNPs to be analysed, medium- to high-throughput technologies are available such as the competitive allele-specific PCR (KASPar) assay from KBiosciences (Hertfordshire, UK; http://www.kbioscience.co.uk) or the Affymetrix Axiom SNP microarrays. The KASPar assay is commonly used for genotyping up to 1000–2000 SNPs, while the Axiom SNP microarrays allow genotyping from 1500 to several million SNPs. As the financial investment for genotyping with KASPar can be three times less expensive than for the Axiom SNP microarray (Charles Poncet, INRA Gentyane Plateform, personal communication), the minimum number of SNPs required for a population genetic analysis was investigated. We found that a minimum of 30 000 SNPs is required to generate all of the maximum-likelihood trees identical to the reference tree produced with the 60 507 SNPs free of selection (Fig. 3). We are thus developing an array based on the 60 507 SNPs for analysing the population genetic structure throughout the natural regions of T. melanosporum production.

Using the mutation rate of 1.3 E−8 (± 2.29 E−9) substitutions per site per year (Ma & Bennetzen 2004; Wicker et al. 2013), we estimated that the 60 507 mutations had accumulated between 107 703 and 154 763 years ago (131 128 ± 23 098 years). These times were used to set the estimated time of the MRCA for the Bayesian phylogenetic reconstruction generated with the 60 507 SNPs free of selection and a relaxed molecular clock. This Bayesian reconstruction clustered the French and the Italian samples together, while the Spanish samples separated earlier (Fig. 4). The Bayesian and maximum-likelihood phylogenies exhibit a single difference in their topology: the France-Pro, France-Alp and Italian samples form a monophyletic cluster in the maximum-likelihood phylogeny, but are paraphyletic in the Bayesian phylogeny (Fig. S6). This could be explained not only by the different methods used in the analyses (Bayesian versus maximum-likelihood), but also by the fact that one phylogeny is time-dependent (relaxed molecular clock in Bayesian), while the other is time-independent. That time-dependent and time-independent phylogenies are not always in agreement has been previously discussed (Drummond et al. 2006). The phylogenetic signal is also likely to be inconsistent across the genome due to the historical proximity of the samples, which increases the chances of finding different trees with different methods. For outbreeding species, such as T. melanosporum, the phylogenetic signal obtains with SNPs could be weaken due to population genetic processes such as recombination and gene flow. However, both topologies are consistent with the geography: in the Bayesian phylogeny, the French south-eastern samples appear as intermediates between the northern French and the Italian and Spanish samples, while they appear as a separate group in the maximum-likelihood topology (Fig. S6). A further characterization of the overall structure of the T. melanosporum populations could be performed using population genetics methods, but they require a larger number of samples to be powerful.

The Bayesian reconstruction suggested that the Spanish samples separated earlier than the French and Italian samples (Fig. 4). Time calibrations and date estimations should be considered with caution, especially for studies without fossil data and incomplete taxon sampling. Thus, we are aware that we cannot use the absolute dates we obtained, but only relative estimates. However, it is highly probable that the MRCA predates the last glaciation (120 000 to 11 000 years ago; Van Andel & Tzedakis 1996). While our sampling study is not sufficient alone to definitively describe the history of T. melanosporum following the last glaciation, this preliminary analysis paves the way for future analyses using the current SNPs resource as we have proposed.
Conclusions
Today, T. melanosporum is primarily harvested in truffle orchards in France using tree seedlings that have been inoculated with truffles in greenhouses (Olivier et al. 2012). Up to now, the selection of geographically defined sources of truffle inoculum has not been considered for plantations. Interestingly, the Aquitaine regional truffle growers' federation has initiated the production of inoculated plants using truffles sampled from specific natural populations that have appeared to be better adapted to drought or frost (P. Rejou, personal communication). To date, the selection of these truffle populations has been empirical and based only on field observations; this approach could now be validated by genotyping the truffle inocula. Moreover, it is now recognized that truffle aroma has, at least in part, a genetic origin (Martin et al. 2010). Investigating a putative phenotype association between SNP markers and traits (such as particular aroma or stress tolerance) can now be contemplated thanks to the current SNPs resource, which is the first step towards a marker-assisted selection of the fungal inocula used by truffle growers.
Acknowledgements
The UMR1136 is supported by a grant overseen by the French National Research Agency (ANR) as part of the Investments for the Future Programme (ANR-11-LABX-0002-01, Lab of Excellence ARBRE). This study benefited from ANR SYSTERRA SYSTRUF (ANR-09-STRA-10). Thibaut Payen's PhD scholarship is cofunded by the Lorraine Region and the European Commission through the EcoFINDERS project (FP7-264465). We would like to thank Francesco Paolocci, Bernard Vonfli, Mario Honrubia, Luc Bernard and Henri Frochot for providing the samples analysed in this study. We also would like to thank Sébastien Duplessis and François Le Tacon for their constructive advice and helpful discussions. Finally, the authors would like to thank the team of American Manuscript Editors for the language and style editing of the manuscript.
References
F.M. and C.M. designed the project. C.M. extracted the DNA, and C.M., T.P., A.G. and E.M. contributed to the bioinformatics analyses. T.P. and S.D.M. performed the selection analyses. C.M., T.P., S.D.M. and F.M. wrote the manuscript.
Data accessibility
Data sequences: the raw sequence data generated in this study were deposited in the NCBI short reads archive under Accession No SRP044130.
SNPs data: the gff file with all 442 326 SNP resources and the 60 507 SNPs free of selection can be downloaded in DRYAD (doi:10.5061/dryad.9gk52) and in our institution website following this link (http://mycor.nancy.inra.fr/IMGC/TuberGenome/download.php?select=anno).
Phylogenetic trees: the nexus and xml files used as input in BEAST as well as the newick files corresponding to the Bayesian and the maximum-likelihood phylogenetic reconstruction are available in DRYAD (doi:10.5061/dryad.9gk52).
Bioinformatic scripts: the python scripts used in this study are available at the INRA Tuber genome portal using the following link (http://mycor.nancy.inra.fr/IMGC/TuberGenome/download.php?select=anno).