Volume 21, Issue 3 pp. 880-896
RESOURCE ARTICLE
Open Access

Development of a highly efficient 50K single nucleotide polymorphism genotyping array for the large and complex genome of Norway spruce (Picea abies L. Karst) by whole genome resequencing and its transferability to other spruce species

Carolina Bernhardsson

Carolina Bernhardsson

Department of Ecology and Environmental Science, Umeå University, Umeå, Sweden

Department of Organismal Biology, Uppsala University, Uppsala, Sweden

Search for more papers by this author
Yanjun Zan

Yanjun Zan

Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Science, Umeå, Sweden

Search for more papers by this author
Zhiqiang Chen

Zhiqiang Chen

Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Science, Umeå, Sweden

Search for more papers by this author
Pär K. Ingvarsson

Pär K. Ingvarsson

Linnean Centre for Plant Biology, Department of Plant Biology, Uppsala BioCenter, Swedish University of Agricultural Science, Uppsala, Sweden

Search for more papers by this author
Harry X. Wu

Corresponding Author

Harry X. Wu

Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Science, Umeå, Sweden

Beijing Advanced Innovation Centre for Tree Breeding by Molecular Design, Beijing Forestry University, Beijing, China

Black Mountain Laboratory, CSIRO National Research Collection Australia, Canberra, ACT, Australia

Correspondence

Harry X. Wu, Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Science, Umeå, Sweden.

Email: [email protected]

Search for more papers by this author
First published: 11 November 2020
Citations: 26
Carolina Bernhardsson and Yanjun Zan contributed equally.

Abstract

Norway spruce (Picea abies L. Karst) is one of the most important forest tree species with significant economic and ecological impact in Europe. For decades, genomic and genetic studies on Norway spruce have been challenging due to the large and repetitive genome (19.6 Gb with more than 70% being repetitive). To accelerate genomic studies, including population genetics, genome-wide association studies (GWAS) and genomic selection (GS), in Norway spruce and related species, we here report on the design and performance of a 50K single nucleotide polymorphism (SNP) genotyping array for Norway spruce. The array is developed based on whole genome resequencing (WGS), making it the first WGS-based SNP array in any conifer species so far. After identifying SNPs using genome resequencing data from 29 trees collected in northern Europe, we adopted a two-step approach to design the array. First, we built a 450K screening array and used this to genotype a population of 480 trees sampled from both natural and breeding populations across the Norway spruce distribution range. These samples were then used to select high-confidence probes that were put on the final 50K array. The SNPs selected are distributed over 45,552 scaffolds from the P. abies version 1.0 genome assembly and target 19,954 unique gene models with an even coverage of the 12 linkage groups in Norway spruce. We show that the array has a 99.5% probe specificity, >98% Mendelian allelic inheritance concordance, an average sample call rate of 96.30% and an SNP call rate of 98.90% in family trios and haploid tissues. We also observed that 23,797 probes (50%) could be identified with high confidence in three other spruce species (white spruce [Picea glauca], black spruce [P. mariana] and Sitka spruce [P. sitchensis]). The high-quality genotyping array will be a valuable resource for genetic and genomic studies in Norway spruce as well as in other conifer species of the same genus.

1 INTRODUCTION

Forests occupy one-third of the global land mass, covering more than four billion hectares of the planet and paly key roles in water, oxygen and nutrient cycles as well as in carbon sequestration (FAO, 2016). The coniferous forest biome makes up one-third of the world's forests, representing 80% of the Earth's biomass (Neale & Kremer, 2011). Conifers also include some of the most important tree species used for plantation establishment, wood production and tree improvement programmes (FAO, 2015). Of the 264 million hectares covered by planted forests (6.6% of the total world forests), more than 50% consist of conifer species. The importance of conifers has motivated large investments into fundamental research on the basic and applied biology of trees (Plomion et al. 2016) and has driven the development of the most advanced tree breeding programmes in the world (Isik & McKeand, 2019; Wu et al., 2016). Projected climate changes in the 21st century are likely to have profound impacts on the functioning of Earth's ecosystems, including most conifer species (Garcia et al., 2014). Their commercial importance and the threats of climate change effects on conifers make it important to study biodiversity, the genetic basis of climate adaptation, and the genomic basis of productivity. Conifers are ideal species for such tasks due to their large geographical distribution and rich genetic diversity (Neale & Wheeler, 2019). To understand the genomic basis of climate adaptation and to accelerate tree breeding programmes in conifers, genetic markers have been used to dissect the genetic basis of adaptive and commercial traits and to explore marker-assisted selection. Traditionally, random DNA markers such as RFLPs, RAPDs, simple sequence repeats (SSRs) and single nucleotide polymorphism (SNP) markers derived from candidate gene approaches have been used for association studies (Thavamanikumar et al., 2013). Due to the limited number of markers available in such studies, the large number of quantitative trait loci (QTLs) underlying quantitative trait variation (Hall et al., 2016), and the rapid linkage disequilibrium (LD) decay in forest trees (Neale & Savolainen, 2004), the dissection of QTLs underlying quantitative trait variation has had limited success in conifers. Consequently, marker-assisted selection has not been implemented in tree breeding (Isik, 2014). However, the recent development of genomic selection (GS), which utilizes large numbers of genome-wide markers to predict complex phenotypes, has the potential to shorten the breeding cycles, increase selection intensity and improve the accuracy of breeding values (Grattapaglia et al., 2018). However, one of the main limiting factors in implementing GS in conifers is the lack of affordable, reliable and abundant genome-wide markers.

Several SNP arrays have recently been developed in conifers for use in genome-wide association studies (GWAS) and GS. These have mostly been based on candidate gene sequencing but have also utilized data from microarrays or RNA sequencing and are generally limited to a few thousand SNPs (Bartholome et al., 2016; Beaulieu et al., 2014; Resende et al., 2012; Zapata-Valenzuela et al., 2013) to several tens of thousands of SNPs (Howe et al., 2020; Perry et al., 2020). Two high-density SNP arrays relying on the Infinium iSelect technology were designed for the conifer species white spruce (Picea glauca), containing 7,338 and 9,559 SNPs, respectively, using in silico SNP prediction through the alignment of transcript sequences and candidate genes (Pavy et al., 2013). A 9K Illumina Infinium SNP array was developed for maritime pine (Pinus pinaster) by bundling markers from SNPs discovered in candidate gene sequencing and from 454 sequencing reads of RNA derived from multiple tissues from three provincial parents (Plomion et al., 2016). A similar Infinium SNP array was developed from in silico SNP resources and exome capture sequencing for black spruce (Picea mariana) (Pavy et al., 2016). Recently, an Axiom SNP genotyping array with 55K SNPs was developed for Douglas-fir (Pseudotsuga menziesii) from transcriptome sequencing (Perry et al., 2020). For Norway spruce, high-quality SNPs have been developed based on large-scale sequence capture and have been employed for both GWAS and GS (Azaiez et al. 2018; Baison et al., 2019; Chen et al., 2018; Vidalis et al. 2018). Various SNP arrays have also been available for poplar and other broadleaved tree species that have been used in association genetics and GS studies (Geraldes et al., 2013). One of the most successful SNP arrays in hardwood tree species is the EUChip60K, which was based on resequencing of 240 trees from 12 species (Silva-Junior et al., 2015) and has been used to genotype many thousands of Eucalyptus trees for GS and GWAS (Grattapaglia et al., 2018).

Conifers, and particularly the commercially important pine and spruce species, have large genomes spanning 20 to 30 Gb. Developing genome-wide SNP arrays, covering both intragenic and intergenic regions, was until recently still a significant challenge due to the lack of high-quality reference genomes. The particular challenge with genotyping conifer genomes stems from their large and complex genomes that contain a high fraction of repetitive elements and abundant polymorphisms, which yields many opportunities for spurious binding of probes or primers. However, recent genome sequencing of several conifer species (Neale et al., 2014, 2017; Nystedt et al., 2013; Stevens et al., 2016; Warren et al., 2015) has made it possible to develop genome-wide marker panels using whole genome resequenced trees for GWAS, population genetics studies and GS. In this paper, we report the development, evaluation and transferability of a highly efficient Norway spruce 50K SNP array using whole genome resequencing, probably for the first time in conifers.

2 MATERIALS AND METHODS

2.1 Plant materials

We used three steps to design and validate the final genotyping array. First, we used whole genome resequenced data based on 35 Norway spruce samples, previously described in Bernhardsson et al. (2020) and Wang et al. (2020), for the initial SNP selection. Second, we screened the selected SNPs in 480 Norway spruce samples collected from two field trials, one consisting of 258 trees from a provenance trial of a species range-wide collection established in Hungary and 222 trees derived from a Swedish breeding population trial established by Skogforsk (Table 1). All 480 samples were screened using a pilot screening array consisting of ~450K SNPs and these data formed the basis for the final SNP selection. Among the 480 trees, nine individuals were replicated twice each to serve as internal controls. Finally, to evaluate the final 50K array we genotyped three sets of samples. First, we genotyped a collection of 28 haploid megagametophytes collected from cones of the reference genome individual Z4006 (Nystedt et al., 2013). Second, a set of Norway spruce full-sibling trios collected from four families (48 trees in total) were genotyped to assess possible Mendelian segregation errors. Finally, we genotyped 49 white spruce (Picea glauca), 61 black spruce (Picea mariana) and 50 Sitka spruce (Picea sitchensis) samples planted in Sweden to assess the between-species transferability of the final array. Detailed information regarding sampling origins and sample metadata are available in Tables S1–S5.

Table 1. Sample origin of the 480 genotypes used for screening the pilot array; numbers in parentheses show the number of samples from each origin and trial that passed the QC thresholds
Sample origin Swedish breeding population trial Hungarian provenance trial Total
Russian-Baltic (Rus_Bal)2 9 10 19
Alpine (ALP)3 63 86 (84) 149 (147)
Central Europe (CEU)4 9 115 (109) 124 (118)
Northern Poland (NPL) 8 13 21
Carpathian (ROM)5 1 16 17
Fennoscandia (NFE)1 41 (38) 1 42 (39)
Southern/Central Scandinavia (C_Sc) 87 17 (16) 104 (103)
Unknown (U) 4 4
Total 222 (219) 258 (249) 480 (468)

Note

  • Sample origin: 1. Fennoscandia contains samples from Finland and northern Sweden; Southern Scandinavia from Central/Southern Sweden and Central/Southern Norway; 2. Russian-Baltic from Russia, Belarus, Estonia, Latvia and Lithuania; 3. Alpine from Denmark, Germany, Switzerland, France and Italy; 4. Central Europe from Slovakia, Czech Republic, Southern Poland, Hungary and Austria: 5. Carpathian from Romania and Bulgaria.

For the haploid megagametophytes, seeds were soaked in 1% H2O2 for 16 hr and germinated in a Petri dish on top of moistened filter paper at room temperate (~21°C). When embryos reached ~5 mm in length, seed coats were removed and megagametophytes were separated from embryos using sterile razor blades and manually ground in liquid N2 in 1.5-ml tubes using plastic pestles. The diploid samples used for screening the pilot array and for validating genotyping rates and for assessing transferability were collected during early summer 2018 and DNA was extracted from either newly flushed needles or from cambium samples. DNA was extracted using a NucleoSpin Plant II DNA Kit (Macherey-Nagel) following the default protocol. Based on NanoDrop 2000 (Thermo Fisher Scientific) measurements, the DNA yield was highly variable among samples, ranging from 303 to 1,116 ng (mean ± SD = 465 ± 201 ng). The extracted DNA samples were shipped to the Microarray Research Services Laboratory at Thermo Fisher Scientific on dry ice and were requantified using picogreen.

2.2 Construction of the pilot screening array

The 35 whole genome resequenced Norway spruce samples were originally collected from Russia (one), Romania (one), Poland (one), Belarus (one), Sweden (22), Norway (five) and Finland (four) (described in more detail in Bernhardsson et al., 2020 and Wang et al., 2020). The WGS samples were used to find and extract candidate genome sequences for probe design of the screening array. In short, the mapping and genotype calling of samples were performed as follows. The raw sequencing reads were mapped against the full version 1.0 assembly of Norway spruce (Nystedt et al., 2013) using bwa mem version 0.7.15 (Li & Durbin, 2009), with default parameters, and the BAM files were subsequently subset by samtools version 1.5 (Li et al., 2009) to only include scaffolds >1 kb. The reduced assembly and bam files (containing 1,970,460 out of ~10 million scaffolds and 9.4 Gb out of 12.5 Gb of the full version 1.0 genome assembly) were then split into 20 subsets, each containing ~100,000 scaffolds. All subset BAM files were then marked for optical duplicates using picard version 2.0.1 (https://broadinstitute.github.io/picard/) and aligned around indels using gatk version 3.7 (McKenna et al., 2010). Per-individual variants were called using gatk haplotypecaller in g.vcf format (DePristo et al., 2011; Van der Auwera et al., 2013) before a joint genotype call over all 35 individuals was conducted separately on the 20 genomic subsets using gatk genotypegvcf (DePristo et al., 2011; Van der Auwera et al., 2013).

The combined raw VCF-file (containing more than 709 million SNPs and 43 million indels, Figure 1) across the 20 genomic subsets was filtered in several steps. First, only bi-allelic SNPs > 5 bp away from an indel and that followed the filtering criteria based on gatk’s “best practice” (https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set) were kept (Bernhardsson et al., 2020). Since six of the WGS samples had a quite low sequence coverage (average coverage ~ 6×) and thereby also a lower confidence in SNP calls, the VCF-files were subset to only include 29 samples derived from Norway, Finland and Sweden (Fennoscandia), which all had high coverage (15–20 × for called sites on average). Since the Norway spruce genome is highly repetitive (~70% of the 1K scaffold assembly contains repeat sequences (Nystedt et al., 2013), we filtered individual calls for depth, accepting a range between 6× and 30× per individual with a genotype quality (GQ) > 15. Only SNPs with an alternative allele frequency (AF) between 0.05 and 0.95 and with a maximum of 30% missing data were kept at this filtering step. To fulfil Affymetrix's filtering criteria (https://tools.thermofisher.com/content/sfs/brochures/snp_template_for_axiom_mydesign_custom_arrays_v2.zip), we then extracted 71-mer probe sequences for SNPs with >20 bp to nearest SNP and where a maximum of five individuals showed missing data. If no gaps (Ns) were found in the probe sequences that we extracted from the assembly, the SNP was considered a good candidate for in silico probe evaluation. A final down-sampling was made of all candidate probes to fit the recommended number of probes used for testing (3,757,630 probe sequences). During this filtering, all SNPs positioned within gene models (hereafter called intragenic SNPs) were kept, while SNPs outside of gene models (hereafter called intergenic SNPs) were filtered for not being A/T or C/G substitutions, as these require twice the number of probes per SNP in comparison to other SNP substitutions. Remaining intergenic SNPs were down-sampled so that every sixth SNP was kept. When ranking the proposed markers, all intragenic markers were considered as “important” while all intergenic SNPs were assigned a “standard” importance. This resulted in a total of 3,757,630 SNPs which were sent to ThermoFisher's bioinformatics service for in silico Axiom testing (Figure 1).

Details are in the caption following the image
Schematic illustration of the variant filtering pipeline for extracting candidate probe sequences for the Axiom in silico testing at ThermoFisher. Each of the filtering steps described in the text is presented in a grey boxes with the number of surviving SNPs labelled beside

For quality control of the array, 8,000 36-mer probe sequences (so called DQC sequences, following ThermoFisher's guidelines) were extracted from monomorphic regions (based on the unfiltered VCF-file for all 35 samples) of a hard-masked version of the Norway spruce assembly. These DQC sequences were evenly distributed between the two strands (+/−) and also between A/T and C/G sites as the probe's ligation position (position 31 in the sequence). In total, 2,000 of these DQCs will be incorporated into the array for control or every run to control for signal variation across the array at sites in the genome known not to vary among individuals.

To select 450K SNPs for the pilot screening array, in silico tests of 3,757,630 SNPs were conducted by Affymetrix. A pConvert score (ranging from 0 to 1) was produced for each SNP by the test. This score reflects the relative probability of probe success and is based on the thermodynamics of the probe sequence itself as well as the number of 16-nt hits found in the reference genome (Affymetrix used the Norway spruce reference genome version 1.0, Nystedt et al., 2013). The probes were first divided into two blocks, “not possible” and “buildable,” where the “not possible” probes are given a pConvert score of 0. For the “buildable” probes, the scores are subsequently translated into three recommendation levels, where a pConvert score of 0.6–1 is “recommended”, 0.4–0.6 “neutral” and 0–0.4 “not recommended.” Among the 3,757,623 SNPs (after removing seven duplicates), 761,311 markers were recommended that had no interfering polymorphisms located within 24 bases on either side of the marker. These recommended markers contained all the intragenic 259,994 markers selected plus the highest ranked and recommended intergenic SNPs (190,499), resulting in a total of 450,493 SNPs that was used for design of the pilot screening array.

2.3 Genotype calling of Axiom screening array

In total, 480 Norway spruce samples from two trials (Table 1) were genotyped using the pilot screening array. Genotype calling of the 450K pilot Axiom screening array was performed using the Axiom analysis suite (version 4.0, available for download at https://www.thermofisher.com/se/en/home/life-science/microarray-analysis/microarray-analysis-instruments-software-services/microarray-analysis-software/axiom-analysis-suite.html), following best practice with default parameters (an SNP call rate cutoff [cr-cutoff] ≥ 0.97 and a sample call using a Dish-QC threshold [axiom_dishqc_DQC] ≥ 0.82) (Affymetrix, 2016). The sample call rate is defined as the average SNP call rate across all SNPs for a sample. The called genotypes were then used to classify the 450,493 SNPs into six categories of SNP performance (Table 2) (Affymetrix, 2016). A VCF file with allelic calls for all 450K SNPs, coded as A, T, C or G, was exported from the Axiom analysis suite and used for all downstream analyses.

Table 2. SNP metrics for the different cluster categories: full screening array, PolyHighResolution (PHR), NominorHom (NH), MonoHighResolution (MHR), CallRateBelowThreshold (CRBT), OffTargetVariant (OTV) and Other (O) markers
Number of SNPs Average heterozygosity Average MAF Average missingness
Full screening array 450,493 (100%) 0.17 (0.00–0.94) 0.13 (0.00–0.50) 0.04 (0.00–0.94)
PHR* SNPsa 176,800 (39.3%) 0.24 (0.00–0.87) 0.17 (0.00–0.50) 0.01 (0.00–0.03)
NH* SNPs 69,455 (15.4%) 0.06 (0.00–0.50) 0.03 (0.00–0.25) 0.01 (0.00–0.03)
MHR* SNPs 12,820 (2.9%) 0.00 (—) 0.00 (—) 0.00 (0.00–0.03)
CRBT SNPs 49,901 (11.1%) 0.28 (0.00–0.85) 0.22 (0.00–0.50) 0.06 (0.03–0.94)
OTV SNPs 3,404 (0.8%) 0.16 (0.00–0.94) 0.10 (0.00–0.50) 0.03 (0.00–0.19)
O SNPs 138,113 (30.7%) 0.17 (0.00–0.89) 0.12 (0.00–0.50) 0.10 (0.00–0.94)
  • a Clusters recommended by ThermoFisher.
  • b Number of SNPs with the percentage of SNPs in parentheses.
  • c Average heterozygosity for SNPs with the range of heterozygosity in parentheses.
  • d Average minor allele frequency (MAF) for SNPs with the range of MAF in parentheses.
  • e Average missingness per SNP with the range of missingness in parenthses.

For the species transferability validation with white, black and Sitka spruce species, genotype calling was made using the best practice pipeline with a few modifications. It was not possible to use the Dish-QC value (axiom_dishqc_DQC > 0.82) and sample call rate (qc_call_rate) ≥ 0.97) as a proportion of the probes were not expected to be transferable to these species. To obtain summary statistics for the probes and call genotypes to evaluate transferability in spruce species, a modified sample Dish-QC value (0.75) and sample call rate (0.75) were used with the remaining setup being identical to the best practice pipeline.

2.4 Selection of the 50K SNP array from the pilot screening array

Although PolyHighResolution (PHR) SNPs, NoMinorHom (NH) SNPs and MonoHighResolution (MHR) SNPs were all recommended by the Axiom analysis suite for consideration in downstream analyses, we selected the final 50K array only from the PHR SNPs for stringency. Three filtering steps were performed on the PHR SNPs to obtain the final 50K probes. SNPs with MAF lower than 0.05 in either of the two populations were excluded. SNPs with pairwise LD ≥ 0.8 (linkage disequilibrium measured as r2) were pruned to reduce the number of nonindependent SNPs. This was achieved by first calculating all pairwise r2 values using vcftools (version 0.1.13) (Danecek et al., 2011). To minimize the computing time due to constant I/O operation, only SNP pairs with r2 values > 0.6 were output by using “vcftools –vcf INPUT.vcf --geno-r2 --min-r2 0.6 –out OUTPUT.” An “igraph” object was subsequently built using the output from vcftools by connecting all SNP pairs with LD ≥ 0.8. Then, independent SNPs were extracted by selecting the maximum number of independent SNPs from each cluster. This was achieved by first building networks that connect all SNPs with LD ≥ 0.8. We selected the hub SNPs and removed the radial SNPs in these networks to minimize the number of selected SNPs while maximizing information retained. Second, selecting hubs and removing the radial loci from the network one at a time will result in the collapse of old networks. We therefore rebuilt the network from the remaining SNPs and then repeated steps 1 and 2 until no networks with more than two SNPs were found. Third, we randomly selected one SNP from the remaining SNPs pairs from step 3. Fourth, the hub SNPs from steps 1 and 2 and SNPs from step 3 were kept for downstream analysis in our study. All these analyses were performed using customized R scripts using the “igraph” package (available at https://github.com/yanjunzan/script/tree/master/umeaArray). Ultimately, SNPs with low average congruence scores (< 0.95, measured as the mean congruency across nine pairs of replicates), and SNPs with heterozygosity levels >0.6, were removed.

To select the final SNPs for the array, we attempted to cover as many of genomic regions as possible by first selecting one SNP per scaffold. If an intragenic SNP within the scaffold was available, that SNP was prioritized, otherwise an intergenic SNP was randomly selected. Meanwhile, G/C and A/T SNPs were avoided whenever possible. To tag as many unique gene models as possible, an additional 160 SNPs were selected to incorporate 160 gene models not yet covered under the preceding procedure. We also included an additional 125 SNPs that were flanking known associations from Baison et al. (2019), Elfstrand et al. (2020) or preliminary associations from GWAS on bud flush, bud set and wood quality traits (our unpublished data). Finally, an additional 1,608 SNPs were randomly selected to bring the total number of selected SNPs up to 47,445, which could fit on the 50K Axiom array together with ~2,000 control probes to account for background noise during imaging analysis. A final investigation, to confirm that the selected SNPs were evenly distributed across the Norway spruce genome, was performed by comparing the targeted scaffolds to available genetic maps (Bernhardsson et al., 2019 and our unpublished data) by counting the number of SNPs and scaffolds positioned on different linkage groups (LGs).

2.5 Evaluation and validation of the 50K genotyping array

To evaluate the performance of the 50K genotyping array, we selected and genotyped another three sets of samples. First, four full-sib Norway spruce families consisting of two parents and between 12 and 14 offspring were genotyped to estimate the Mendelian inheritance (MI) error rate. The MI error rate was calculated as the proportion of family trios that violate the Mendelian inheritance rule. For example, under Mendelian inheritance only AB genotypes should be observed in the offspring when the parents are homozygous AA and BB, respectively. Similarly, when parents are homozygous AA and heterozygous AB their offspring should contain the two genotypes AA and AB. Second, 28 haploid megagametophytes were genotyped to evaluate the probe specificity and examine whether probes were binding to different paralogues. For a 100% probe specificity, all genotyped megagametophytes should be homozygous. Therefore, a specificity error rate was calculated for each probe as the proportion of megagametophytes showing a heterozygous call. Third, 160 samples from three other spruce species (white spruce, black spruce and Sitka spruce) were genotyped to evaluate the transferability of our spruce array to other spruce species.

2.6 Principal component analysis and population structure

The population structure of the screening array samples was visualized using a principal component analysis (PCA). First, the realized additive relationship matrix (Figure 2) was constructed using the “A.mat” function from the rrBLUP R package (Endelman, 2011) and then a scaled and centred PCA was performed using the 459 nonreplicated samples with the “prcomp” function in R (R Core Team, 2015). This was done by using either all PHR SNPs or the final 50K selected SNPs (Figure 5). The goal was to assess whether the estimated population structure was similar between the 50K and the all PHR SNP (177K) sets.

Details are in the caption following the image
Visualization of the additive relatedness matrix estimated across all 468 samples. The relatedness matrix was calculated with the A.mat function in the R package “rrBLUP” using all PolyHigh resolution SNPs (176,800). Inset: zoom of the nine replicated samples

2.7 Further assessment on ascertainment bias, population structure and genetic diversity

Allele frequency distribution for the selected ~50K array and PHR ~ 177K SNPs were compared to evaluate the selection bias in terms of MAF. In addition, we compared the MAF and heterozygosity between the range-wide provenance trial collection and Skogforsk's breeding population samples to determine how well the Swedish breeding population captured range-wide genetic diversity. These parameters were calculated for the 50K selected SNPs within each population. Using the estimates above, we also assessed the difference in diversity and how population structure was captured using intergenic or intragenic SNPs. All analyses were implemented with customized R/python scripts that are available on github https://github.com/yanjunzan/script/tree/master/umeaArray.

3 RESULTS

3.1 Construction of the 450K pilot screening array

A total of 3,757,630 SNPs including all intragenic SNPs (692,845) and every sixth of the non-A/T or C/G intergenic SNPs (3,064,785) were selected from the original >709 million SNPs, by the multiple filtering processes (Figure 1). These SNPs were sent to ThermoFisher for in silico probe evaluation and selection. After evaluation, all recommended intragenic SNPs (259,994) and the best ranked intergenic SNPs (190,499) were chosen for construction of the 450K pilot screening array.

3.2 Screening of the 450K pilot array and selection of the final 50K Axiom array

A total of 468 samples (97.5% of the total 480) passed the quality control for genotype calling and were considered successfully genotyped by the 450K screening array (Table 1). Based on the pairwise additive relationship, the nine replicated samples could be fully identified (Figure 2), which gave an average estimated genotype reproducibility of 99.8% over all 450K pilot array SNPs.

Based on hybridization performance and called genotypes, the SNPs were grouped into six categories. The pilot screening array SNPs were composed of all six categories (Table 2), with the largest number of SNPs (39.3%) belonging to the PHR SNPs. Average heterozygosity for all 450K SNPs was 0.17, with MAF of 0.13 and missingness of 0.04. The PHR SNPs displayed higher levels of both heterozygosity (0.24) and MAF (0.17), and showed a lower level of missingness (0.01) compared to the remaining SNPs. The other two recommended SNP categories, MHR and NH, showed very low levels of genetic variation among the 468 samples (Table 2). PHR SNPs were therefore the only category considered for the final 50K array.

In order to select the final ~50K SNPs, the ~177K PHR SNPs were filtered to only keep independent SNPs while tagging as many unique contigs and gene models as possible. This resulted in a final selection of 47,445 SNPs, covering 45,552 scaffolds and 19,794 gene models (Figure 3). To evaluate the genomic distribution of the selected ~50K SNPs, targeted scaffolds were compared to available genetic linkage maps (Bernhardsson et al., 2019 and our unpublished data), and the number of scaffolds positioned on the genetic maps, as well as the number of selected SNPs on that scaffold, were recorded for each linkage group. In total, 16,659 (35.2%) of the SNPs and 15,103 (33.3%) of the scaffolds could be positioned on the 12 LGs (Table 3), showing that the SNPs selected for the array have a genome-wide distribution. In total, 345 of these scaffolds, harbouring 482 SNPs, appear to be split across several LGs, indicating potential assembly errors (Table 3) (Bernhardsson et al., 2019).

Details are in the caption following the image
Schematic illustration of the probe selection pipeline from the 450K screening array to the final 50K array
Table 3. Distribution of the ~50,000 final array markers positioned on scaffolds previously mapped to genetic linkage groups (LGs) (Bernhardsson et al., 2019 and our unpublished data)
LG Number of markers (scaffolds) Percentage of mapped markers (scaffolds) Percentage of total number of markers (scaffolds)
LG 1 1,539 (1,403) 9.2% (9.3%) 3.2% (3.1%)
LG 2 1,342 (1,212) 8.1% (8.0%) 2.8% (2.7%)
LG 3 1,392 (1,271) 8.4% (8.4%) 2.9% (2.8%)
LG 4 1,306 (1,187) 7.8% (7.9%) 2.8% (2.6%)
LG 5 1,360 (1,221) 8.2% (8.1%) 2.9% (2.7%)
LG 6 1,260 (1,148) 7.6% (7.6%) 2.7% (2.5%)
LG 7 1,450 (1,327) 8.7% (8.8%) 3.1% (2.9%)
LG 8 1,364 (1,260) 8.2% (8.3%) 2.9% (2.8%)
LG 9 1,312 (1,187) 7.9% (7.9%) 2.8% (2.6%)
LG 10 1,303 (1,198) 7.8% (7.9%) 2.7% (2.6%)
LG 11 1,186 (1,089) 7.1% (7.2%) 2.5% (2.4%)
LG 12 1,363 (1,255) 8.2% (8.3%) 2.9% (2.8%)
Scaffold split over several LGs 482 (345) 2.9% (2.3%) 1.0% (0.8%)
Total 16,659 (15,103) 100% (100%) 35.2% (33.3%)
  • a The linkage group (LG) that the marker scaffolds were mapped to in the genetic maps. Markers positioned on scaffolds shown to be split over several LGs in the genetic maps are presented as a separate category.
  • b Number of markers positioned on scaffolds mapped to a certain LG. Number of unique scaffolds that are mapped to a certain LG is presented in parentheses.
  • c Percentage of mapped markers (16,659 in total) that are positioned on scaffolds mapped to a certain LG. Percentage of unique scaffolds (15,103 in total) is presented in parentheses.
  • d Percentage of markers (47,445 in total) that are mapped to a certain LG. Percentage of unique scaffolds (45,552 in total) is presented in parentheses.

Highly fragmented genome assemblies that are lacking large fractions of the genome due to high genomic repetitiveness can suffer from collapsed read mappings, which in turn may result in spurious SNP calls. Such false SNPs will show strong deviations from Hardy–Weinberg equilibrium (HWE) because they will have an excess of heterozygous calls due to the misalignment of reads from multiple genomic regions (Bernhardsson et al., 2020; McKinney et al., 2017). To analyse how the selected ~50K SNPs behave in comparison to the whole ~450K screening array and the ~177K PHR SNPs in terms of HWE, the MAF of each SNP was plotted against its observed heterozygosity (Figure 4). While the full ~450K screening array contains numerous SNPs with either too low or too high heterozygosity relative to their MAF, the majority of PHR SNPs and the selected ~50K SNPs follow the expected pattern under HWE. The selected SNPs also spanned the entire range of MAFs of the PHR SNPs, except at MAF < 0.05 because these were deliberately filtered out due to low polymorphism rates.

Details are in the caption following the image
Scatter plot of the minor allele frequency and heterozygosity for the final SNP selection (50K, right red) in comparison to all screened SNPs (450K, grey) and all PolyHigh resolution SNPs (177K, dark red)

PCA indicates that the final 50K SNP set captures the same population structure as the PHR 177K SNP set for both the trees from the range-wide provenance trial and the trees from the Swedish breeding population. The two clusters representing the two trials form a classical “horseshoe shape” (Figure 5) that is characteristic of samples where genetic similarity decays with geographical distance (Novembre & Stephens, 2008). The trees from two trials (Skogforsk and Hungary) showed a partly overlapping population structure even though the majority of the Skogforsk breeding population, which contains more samples with a Northern origin, occupy the right cluster while the Hungarian trial, which contains more samples with an Alpine or a central Europe origin, occupy the left cluster (left panel in Figure 5; Table 1). The patterns were clearer when looking at origins of all samples rather than to which trial they belonged. Samples in the right cluster had a northwest–northeast origin (with samples from Fennoscandia [FNE], Southern/Central Scandinavia [C_Sc], Russian-baltic [Rus_Bal] and Northern Poland [NPL]) while the left cluster had a more southwest–southeast origin (with samples from the Alpine region [ALP], central Europe [CEU) and Carpathians [ROM]). The four samples with unknown origin grouped in the middle of the FNE samples (right panel in Figure 5). Four of the documented ALP samples were positioned in between the two clusters, which might indicate a hybrid origin, and a small proportion of the samples did not group according to their documented origin, which might indicate sample mix-ups when the population trials were established and the sample origins were documented.

Details are in the caption following the image
Population structure estimated using a principal component analysis on the relatedness matrix calculated based on all 177K PolyHigh resolution SNPs (top row) and from the final 50K SNP selection (bottom row). Left-hand panels are coloured based on which provenance trial the samples origginate from while the right-hand panels are coloured based on documented sample origin. Replicated samples have been removed from the analysis. NFE—Fennoscandia contains samples from Finland and northern Sweden; C-sc—Southern Scandinavia from Central/Southern Sweden and Central/Southern Norway; Rus_Bal—Russian-Baltic from Russia, Belarus, Estonia, Latvia and Lithuania; NPL—Northern Poland; ROM—Carpathian from Romania and Bulgaria; CEU—Central Europe from Slovakia, Czech Republic, Southern Poland, Hungary and Austria; ALP—Alpine from Denmark, Germany, Switzerland, France and Italy; U—unknown

3.3 Evaluation and validation of the 50K array

Twenty-eight Norway spruce haploid megagametophytes (Table S3), 48 samples from four full sib families consisting of the two parents and between 12 and 14 offspring and 160 samples from white, black and Sitka spruce (Table S4) were used for validation of the final 50K SNP array. Because this array was specifically designed for Norway spruce, joint genotype calling for all samples/species using the Axiom best practice was not possible due to the variable probe performance in the three other species. Therefore, two independent genotyping calls were performed, one for all Norway spruce samples following the best practice in the Axiom analysis suite and a second run for other the spruce species which employed slightly lower sample QC values. A few samples, including four offspring, four haploid megagametophytes and one black spruce, were removed from the downstream analyses because they failed the sample QC. The overall performance of this array was then evaluated using sample and probe (SNP) call rate, probe specificities and MI error rates estimated from the remaining samples.

3.3.1 Sample and SNP call rate and probe specificity

The average sample call rate was 98.90% (minimum 97.67% and maximum 99.43%, Figure 6a). Out of the 47,445 probes, 45,541 (96%) were classified in the three high-confidence categories (PHR, MHR, NH) with an averaged call rate of 99.11% (minimum 85.77% and maximum 100.00% Figure 6b). The remaining 1,904 SNPs, classified as OTV or Other, were not recommended for reasons described above (Table S1). The averaged probe specificity, calculated as the proportion of samples with homozygous calls among 24 haploid megagametophytes, was 99.5% (Figure 6c; Table S5). The high specificity and call rate illustrate that the designed array is of high quality.

Details are in the caption following the image
Summary of the array evaluation metrics. (a) Histogram of the sample call rate for Norway spruce. The dashed red line indicates the averaged call rate. (b) Histogram of the probe call rate for Norway spruce. The dashed red line indicates the averaged call rate. (c) Histogram of the proportion of homozygous calls for 45,541 probes estimated using 24 haploid tissues. The dashed red line indicates the averaged proportion of homozygous calls. (d) Histogram of the Mendelian inheritance (MI) error rate for 36,256 probes estimated using 48 family trios. (e) Principal component analysis for all four spruce species. (f) Principal component analysis for the three non-Norway spruce species

3.3.2 Mendelian inheritance (MI) error rate

Among 45,541 high-confidence probes, 6,438 were fixed for alternative alleles (P1 = AA, P2 = aa) in at least one family and 36,256 were fixed for the same allele (P1 = AA, P2 = AA) in at least one family. Unfortunately, those two sets of probes completely overlap with each other, resulting in 36,256 probes which could be evaluated for Mendelian segregation errors (see Materials and Methods). Overall, there were very low rates of Mendelian segregation errors, with 97.8% of the probes having MI error rates of <5% (Figure 6d).

After QC for probe call rate, specificity and MI error rate from samples of family trios and haploid megagametophytes, 1,645, 1,298 and 797 probes may not meet quality standards, yielding at least 42,598 (90%) high-quality probes on the array that are available for genotyping analyses with high confidence (Table S5).

3.3.3 Array ascertainment bias

The MAF values of SNPs were divided into 25 bins (2% intervals) and the frequency distributions were compared between the 50K array and the full MAF distribution of the ~177K PHR SNPs. The results show that the final array captured on average 2.7% of the SNPs from each MAF bin with relatively even coverage from 2.2% to 2.9% except for MAF < 5% that was excluded intentionally when selecting SNPs from the ~177K PHR SNPs (Figure S1a). This indicates that there was no obvious bias in the selection of SNPs based on MAF.

3.3.4 Comparison of genetic diversity between range-wide collection and breeding populations

When comparing the distribution of MAF and heterozygosity between the range-wide provenance trial and the Skogforsk breeding population, we noticed a slight enrichment of low-frequency alleles in the provenance trial (mean MAF is 0.16 and 0.18 for the provenance trial and Skogforsk population, respectively; Figure S1b,c) and a slightly lower heterozygosity (0.23 for the provenance trial and 0.27 for the Skogforsk population; Figure S1d,e). In addition, there were 66 SNPs that were fixed in the provenance trial but which were all segregating in the breeding population. The array was designed based on variants segregating in a resequencing panel consisting of trees sampled from the Nordic countries, and the 66 nonvariable SNPs observed in the range-wide provenance population could therefore indicate a slight ascertainment bias in the SNPs included on the array.

3.3.5 SNPs from intragenic and intergenic regions

We observed a minor, but statistically significant difference in both MAF (mean MAF is 0.169 and 0.176 for intergenic and intragenic SNPs, respectively; p = 1.0 × 10−7 from t test) and heterozygosity (mean heterozygosity is 0.250 and 0.256 for intergenic and intragenic SNPs, respectively; p = 8.5 × 10−9 from t test) in the screening data. However, these differences are only significant due to the large number of SNPs assessed and do not represent biologically significant differences. In line with this, the two sets of SNPs differ very little in the population structure they capture (Figure S1f–i).

3.3.6 Transferability to other spruce species

Although the array was designed to target Norway spruce, half of the probes (23,797) were called with high confidence in three other spruce species (white, black and Sitka spruce). A PCA on all the samples clearly separated the four species into two major clusters (Figure 6e). As expected, the other three spruce species, which all belong to the North American clade of Picea (Clade II in Lockwood et al., 2013), were more genetically similar to each other than to Norway spruce. To evaluate whether these markers could be used to further distinguish the three North American species, a subsequent PCA with only the North American species was performed (Figure 6f). In this analysis, the three species were clearly separated into three major clusters with black spruce being closer to Sitka spruce than to white spruce, as expected, based on a published phylogeny for the genus Picea based on plastid, mitochondrial and nuclear sequences (Lockwood et al., 2013). These results demonstrate a potentially broader application of this array for more species within the same genus.

4 DISCUSSION

Development of efficient genotyping resources for identifying alleles underlying local adaptation, trait variation and GS in conifers is a significant challenge due to their large and complex genomes (Neale & Wheeler, 2019). Dissection of the molecular basis of trait variation in forest trees began in the 1990s with the introduction of QTL mapping in controlled-cross pedigrees using random DNA markers (Neale & Kremer, 2011; Neale 2004; Strauss et al., 1992). Later, SNP markers from candidate genes were used to exploit population-wide LD to perform association mapping (AM). The AM approach was initially applied in Eucalyptus (Thumma et al., 2005) and has subsequently been used in many conifer tree species (Beaulieu et al., 2011; Dillon et al., 2010; Gonzalez-Martinez et al., 2007). However, neither QTL analysis using limited family pedigrees nor the candidate gene approach for AM resulted in the identification of useful markers for forest breeding. This is because QTLs were mapped with very large confidence intervals on chromosomes due to the limited number of markers used (Grattapaglia et al., 2018).

To increase the marker density for AM in conifer trees, access to a genome-wide SNP array would enable high-throughput and relatively cost-efficient genotyping. SNP arrays have already been developed for a number of spruce species and in other conifers based on transcriptome data (Howe et al., 2020; Perry et al., 2020; Plomion et al., 2016). However, transcriptome-based approaches, such as RNA sequencing, have thus far yielded relatively small arrays, covering <10,000 SNPs in most cases, and due to the nature of transcriptome data they also generally lack genomic information from intergenic regions (Bartholome et al., 2016; Pavy et al., 2013, 2016).

The Axiom 50K Norway spruce SNP genotyping array is a novel and efficient resource for population and quantitative genetics and for GS studies. The array contains known intragenic and intergenic SNPs that are evenly distributed across the Norway spruce genome. The three-step strategy we used, with probe development based on WGS samples, screening of a large number of preliminary SNPs using two large trials, a breeding population and a species-wide range collection, and final array evaluation using both haploid and within-family segregation analyses to assess SNP specificity and Mendelian segregation of SNPs proves that this array is highly efficient and robust.

In comparison to other genotyping techniques, such as WGS, genotyping-by-sequencing (GBS) and sequence capture, which are computationally and bioinformatically demanding and/or expensive to perform (Baison et al., 2019; Pan et al., 2015; Wang et al., 2020), SNP arrays are less computationally demanding to analyse because the majority of the bioinformatics analyses were made when the chip was developed. GBS data often also include a large fraction of missing data which requires imputation and computational interpretation prior to subsequent analysis (Hussain et al., 2017). This makes our array very valuable for scientists and breeders with limited bioinformatic knowledge. The spruce genome, which is both very large (~19.6 Gb) and highly repetitive (~70% repeat content in scaffolds >1,000 bp), has made it difficult to develop a reference genome assembly of high quality. With only ~66% of the genome present in the currently available assembly (Nystedt et al., 2013), a large proportion of resequencing reads are redundant because they cannot be mapped to the assembly, which in practical terms increases the cost of sequencing per mapped base. However, there is also a risk that a proportion of the reads mapping to the reference would be misaligned if repetitive regions are collapsed in the assembly. This would increase the number of false variants in downstream analysis (Bernhardsson et al., 2020). This is another advantage of our Axiom 50K SNP genotyping array, as these risks were minimized by carefully selecting the probes to avoid such problematic genomic regions and subsequently evaluating the probe performance by specifically assessing probe specificity using haploid samples.

4.1 Screening array design and performance

Resequencing data have not been employed for selection of SNPs for a genotyping array in any conifer species to date, but this practice has been commonly used in many fruit trees and crops (Basil et al. 2015; Bianco et al. 2016; Singh et al. 2015; Marrano et al., 2019; Pandey et al. 2017; Roorkiwal et al. 2018; Wang et al 2016) and is often combined with a large screening array (Montanari et al., 2019; Unterseer et al. 2014). Our screening array results indicate that prescreening of SNP aids in the design of a high-quality genotyping array in conifers. Although large parts of the current assembly suffer from collapsed genomic regions (Bernhardsson et al., 2020), we are able to select all 450K probes with the highest ThermoFisher's pConvert score category (0.6 ≤ pConvert ≤ 1.0) from the ~3.76 million candidate SNPs which were obtained through filtering from the original >709 million SNPs. After using the screening array to genotype 480 trees, ~58% of the 450K screening probes yielded high-confidence SNPs that were recommended for inclusion on the final genotyping array by ThermoFisher (Table 2). In total, 39% of screening probes were also classified as PHR SNPs, making them high-quality candidates for the final array. With such a large number of PHR SNPs available to us, we were able to include only PHR SNPs on the final array.

4.2 Genotyping array performance

We evaluated the 50K genotyping array for probe specificity (uniqueness on the genome), Mendelian segregation error and population structure between the genotyping array data and the full set of ~177K PHR SNPs.

Probe specificity is particularly important for conifer genomes which are known to harbour abundant paralogues, pseudogenes and repeats. The specificity of the 50K SNP array is 99.5%, indicating that the SNPs selected are highly reliable and that they target unique regions in the Norway spruce genome. The probability that a probe hybridizes to more than one region of the genome is thus very low, being about 0.5%. Benchmarking probe specificity with SNP arrays developed for conifers or forest trees is not possible as probe specificities have not been reported for other arrays.

The probes on the 50K array were evenly distributed throughout the Norway spruce genome and also evenly distributed between intra- and intergenic regions, offering a truly genome-wide coverage that will be highly valuable for several downstream applications. The final array validation also showed that the selected SNPs have low Mendelian inheritance (segregation) error rates, with 98% of the probes having MI error rates < 5%, similar to what was observed for the EUChip60k (Mendelian allelic inheritance concordance> 95%, Silva-Junior et al., 2015).

The final Axiom 50K array was as efficient as the full 177k PHR set in identifying true population structure in the 468 screening samples and it had a high precision in identifying the origin of four unknown samples (Figure 5). The Swedish breeding population was sampled from a total of 5,056 breeding trees. The population structure of these trees has previously been studied using 134,605 SNPs derived from ~40,000 sequence capture probes (Chen et al, unpublished data). The 50K genotyping array identified the same population origin and structure for the 222 Swedish samples (e.g., seven geographical populations) that were obtained using either the 177K PHR SNP set or when using the large sequence capture SNP data set.

Across all 76 Norway spruce samples (28 megagametophytes and 48 family trio samples) genotyped using the 50K array for performance evaluation, as many as 45,000 SNPs (96%) were shown to belong to the three highest confidence categories (PHR, NH and MHR) with an average sample call rate of 98.9% and a SNP call rate of 99.11%. This is very high in comparison with results in Douglas-fir (88.2% sample call rate and 50.4% SNP call rate, Howe et al., 2020) and other tree SNP arrays which generally have failure rates on the order of 20% (Plomion et al., 2016). The sample and SNP call rates using our 50K array is comparable or even higher than the EUChip60K array data (average SNP call rate > 90% and sample call rates across all SNPs > 97%) even though our genome is about 30 times larger and substantially more complex than the Eucalyptus genome (Silva-Junior et al., 2015).

The reproducibility of a replicated sample is an important quality benchmark of array performance. The white spruce Infinium assays (PgAS1 of 13K SNPs and PgLM3 of 14K SNPa) estimated 99.5% and 99.9% reproducibility (Pavy et al., 2013) and the genotyping accuracy for duplicated trees in Douglas-fir was 99.3% (Howe et al., 2020). Our screening array of 450K SNPs of Norway spruce had a reproducibility of 99.8% for replicated samples across all SNPs and the selected 50K SNPs had 100% reproducibility among the replicated samples, similar to what was observed for the EUChip60K array (Silva-Junior et al., 2015).

4.3 Array ascertainment bias

When designing an SNP array, the ascertainment procedures of the SNPs selected for inclusion on the array need to be carefully evaluated in future applications, such as population genetics and GWA studies. SNPs included on the array were selected to fulfil specific criteria, such as MAF, and therefore represents a biased subset compared to a random sample of SNPs. Such ascertainment bias causes systematic deviations of population genetic statistics from theoretical expectations and will inevitably be present when SNP array data are used for estimating population genetic parameters, such as genetic diversity, or when inferring population structure or the demographic history of a sample (Lachance & Tishkoff, 2013).

There are two kinds of ascertainment bias that need to be considered for SNP array data, depth and width. Ascertainment depth refers to the fact that only SNPs occurring with sufficient number in a sample population (e.g., minimum MAF) are included on the final array. Ascertainment width, on the other hand, is affected because markers are generally first identified in a small panel of individuals from part of the species’ range. However, a comparison of MAF distributions between the 50K array and the full ~177K PHR SNPs revealed no significant bias in ascertainment SNP depth. When comparing the distribution of MAF and heterozygosity between our range-wide provenance trial and Skogforsk breeding population samples, we noted a slight enrichment of low-frequency alleles and consequently a slightly lower heterozygosity in the provenance trial. However, as the 29 trees used to design the array all had a Nordic origin (Central/Southern Sweden and Fennoscandia), this probably reflects a slight bias in ascertainment width, as more alleles with a Northern origin were captured in the resequenced samples. This small bias may reflect the possible influence of hybridization between Picea abies with P. obovata in Fennoscandia (e.g., Tsuda et al., 2016). Hybridization between the two species is known to have influenced genetic diversity in Fennoscandian populations, with a gradient of increasing effects of hybridization closer to the Ural Mountains (Tsuda et al., 2016). The range of distribution of P. abies in Fennoscandia represents the most recent expansion of this species following the last glaciation. There is also evidence that central Fennoscandia could have slightly higher levels of genetic composition and diversity due to the meeting of the two expansion routes that colonized this region since the LGM (Lagercrantz & Ryman, 1990).

4.4 SNPs from intragenic and intergenic regions

SNPs from intergenic regions are important for detecting associations in GWAS and inclusion of a large number of SNPs from intergenic regions is expected to increase both GWAS power and the efficiency of GS. For evolutionary population genetic analyses, markers in intragenic and intergenic regions may generally differ in patterns of variation, selection signature and their effects on trait variation. Thousands of trait-associated SNPs have been identified in intergenic regions in humans, and half of the disease-associated SNPs in humans that thus far have been identified reside within intergenic regions (Li et al., 2016). SNPs in genic regions are also more likely to display signatures of both positive and negative selection than SNPs in nongenic regions, and intergenic SNPs are key components of the spatial and regulatory network for human growth (Coop et al. 2009; Helyar et al., 2011; Schierding et al., 2016).

It has also been shown that intergenic and intragenic regions behave differently in terms of population genetic summary statistics in Norway spruce (Wang et al., 2020) and intergenic regions appear to have a higher impact on adaptation in species with larger genomes (Mei et al., 2018). SNP arrays developed thus far in conifers have largely been based on candidate gene and/or transcriptome sequencing because markers on those SNP arrays are mainly situated in or close to genes they may not provide a representative view of genome-wide variation. In our array, we noted minor but statistically significant differences in both MAF and heterozygosity between intergenic and intragenic SNPs. This could indicate historical differences in the action of natural selection or the demographic history for different genomic regions in our screening populations. However, the two SNP sets differ very little in the pattern of population structure that they capture, suggesting that such effects may be small. By combining both intergenic and intragenic SNPs on our genotyping array, we can therefore give a much clearer picture of the genomic landscape of variation in terms of population genetic variation, adaptation and possibly also phenotype associations.

4.5 Array transferability to other spruce species

The genus Picea consists of a total of 35 species (Farjón, 2001). We tested the transferability of the array to three other spruce species that are important in commercial plantation, breeding and production in the northern hemisphere, white spruce, black spruce and Sitka spruce. We found that about 50% of the SNPs (23,797) can be reliably transferred to the three species and genotyped with high confidence. This transferability is high and similar to the 57% transfer rate observed between white spruce and Norway spruce (e.g., 0.5 million probes derived from 23,684 genes of white spruce were mapped to 13 543 Norway spruce genes) by Azaiez et al. (2018). The transferability of our SNP array is higher than what was observed for a white spruce SNP array used to genotype Sitka spruce (22.4%), black spruce (17.6%) or Norway spruce (12.5%) (Pavy et al., 2013). Our array is also able to clearly separate Norway spruce (Clade I) from the more distantly related species from the North American clade (white, black and Sitka spruce from Clade II, Lockwood et al., 2013). Picea obovata and P. omorika are two species that are more closely related to P. abies (all in Clade I) than the three North America spruce species. Although these two species are not of great commercial importance, the latter species has been the focus of conservation efforts and the SNP array could therefore potentially be applied to perform more basic research in this species. However, we have not tested the conversion rates of the array for these two closely related species, but given the close relationship among these three species, we expect the array will have a high level of success rate when genotyping P. obovata and P. omorika samples.

The 50K SNP genomic resources presented and evaluated for Norway spruce in this study represent an unprecedented effort to deploy high-throughput SNP genotyping in conifers. The 50K SNP array is the largest genotyping chip ever produced for any spruce species and included SNPs from both intragenic and intergenic regions. We envisage that this array will make significant contributions to questions related to population genetics, comparative genomics, association genetics, genomic prediction and linkage mapping in Norway spruce as well as providing a template for designing future genotyping arrays in other spruce and conifer species.

ACKNOWLEDGEMENTS

This project was supported by the Swedish Foundation for Strategic Research (SSF) to H.X.W. (RBP14-0040) and Horizon2020 B4EST. The computation and data handling provided by the Swedish National Infrastructure for computing (SNIC) at Uppmax was partially funded by the Swedish Research Council (2016-07213). We would like to thank Tomas Funda, Lu Wang, Zhou Wei, Zuzana Binova and Linghua Zhou for help with DNA extraction, and Bo Karlsson, Anders Fries, Éva Ujvari Jarmay, László Nagy, David Hall, Jingxiang Meng and Ruiqi Pian for their assistance in field sample collections. We also thank Fikret Isik for organizing and coordinating the Conifer SNP array Consortium.

    AUTHOR CONTRIBUTIONS

    H.X.W. designed and planned the project; C.B. conducted analyses for resequencing, pilot array and population structure. Y.Z. conducted SNP calling for pilot and genotyping array. Z.C. designed and sampled trees. P.K.I. designed the resequencing experiment. C.B., Y.Z. and H.X.W. wrote the manuscript.

    Research interest: C.B. interested in bioinformatics; Y.Z. interested in bioinformatics; Z.C. interested in quantitative genetics; P.K.I. interested in population genetics; and H.X.W. interested in quantitative genetics and breeding.

    DATA ACCESSIBILITY STATEMENT

    Data from this project are archived in Figshare and accessible as: 1. Axiom 50K array for ~300 Norway spruce raw data and annotation file at https://doi.org/10.6084/m9.figshare.12631358.v1. 2. Axiom 450K SNP array for 480 Norwary spurce raw data and Array annotation file at https://doi.org/10.6084/m9.figshare.12630938.v1.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.