Volume 21, Issue 1 pp. 316-326
RESOURCE ARTICLE
Open Access

A chromosome-level genome assembly of the woolly apple aphid, Eriosoma lanigerum Hausmann (Hemiptera: Aphididae)

Roberto Biello

Roberto Biello

Department of Crop Genetics, John Innes Centre, Norwich Research Park, Norwich, UK

Search for more papers by this author
Archana Singh

Archana Singh

Earlham Institute, John Innes Centre, Norwich Research Park, Norwich, UK

Search for more papers by this author
Cindayniah J. Godfrey

Cindayniah J. Godfrey

NIAB EMR, Kent, UK

Search for more papers by this author
Felicidad Fernández Fernández

Felicidad Fernández Fernández

NIAB EMR, Kent, UK

Search for more papers by this author
Sam T. Mugford

Sam T. Mugford

Department of Crop Genetics, John Innes Centre, Norwich Research Park, Norwich, UK

Search for more papers by this author
Glen Powell

Glen Powell

NIAB EMR, Kent, UK

Search for more papers by this author
Saskia A. Hogenhout

Saskia A. Hogenhout

Department of Crop Genetics, John Innes Centre, Norwich Research Park, Norwich, UK

Search for more papers by this author
Thomas C. Mathers

Corresponding Author

Thomas C. Mathers

Department of Crop Genetics, John Innes Centre, Norwich Research Park, Norwich, UK

Correspondence

Thomas C. Mathers, Department of Crop Genetics, John Innes Centre, Norwich Research Park, Norwich, UK.

Email: [email protected]

Search for more papers by this author
First published: 28 September 2020
Citations: 28

Abstract

Woolly apple aphid (WAA, Eriosoma lanigerum Hausmann) (Hemiptera: Aphididae) is a major pest of apple trees (Malus domestica, order Rosales) and is critical to the economics of the apple industry in most parts of the world. Here, we generated a chromosome-level genome assembly of WAA—representing the first genome sequence from the aphid subfamily Eriosomatinae—using a combination of 10X Genomics linked-reads and in vivo Hi-C data. The final genome assembly is 327 Mb, with 91% of the assembled sequences anchored into six chromosomes. The contig and scaffold N50 values are 158 kb and 71 Mb, respectively, and we predicted a total of 28,186 protein-coding genes. The assembly is highly complete, including 97% of conserved arthropod single-copy orthologues based on Benchmarking Universal Single-Copy Orthologs (busco) analysis. Phylogenomic analysis of WAA and nine previously published aphid genomes, spanning four aphid tribes and three subfamilies, reveals that the tribe Eriosomatini (represented by WAA) is recovered as a sister group to Aphidini + Macrosiphini (subfamily Aphidinae). We identified syntenic blocks of genes between our WAA assembly and the genomes of other aphid species and find that two WAA chromosomes (El5 and El6) map to the conserved Macrosiphini and Aphidini X chromosome. Our high-quality WAA genome assembly and annotation provides a valuable resource for research in a broad range of areas such as comparative and population genomics, insect–plant interactions and pest resistance management.

1 INTRODUCTION

There are ~5,000 known species of aphid (Hemiptera: Aphididae), and ~100 of these are of significant agricultural economic importance (Blackman & Eastop, 2017). While aphid genomics research has mostly focused on the subfamily Aphidinae (International Aphid Genomics Consortium, 2010; Li et al., 2019; Mathers, 2020; Mathers et al., 2017; Mathers, Mugford, et al., 2020; Mathers, Wouters, et al., 2020; Nicholson et al., 2015; Thorpe et al., 2018; Wenger et al., 2016), a large and widespread group including many important pests (Blackman & Eastop, 2000), investigation of genome evolution across the wider diversity of aphids has been limited (but see Julca et al., 2020). This report represents the first complete genome sequence for a species from the subfamily Eriosomatinae, which includes three potentially polyphyletic tribes (Eriosomatini, Fordini and Pemphigini), all of which are distantly related to members of Aphidinae (Li et al., 2014; Nováková et al., 2013; Ortiz-Rivas & Martínez-Torres, 2010).

Woolly apple aphid (WAA, Eriosoma lanigerum Hausmann, tribe Eriosomatini) is probably North American in origin. It was probably introduced to Britain in 1796 or 1797 with infested apple trees imported from America (Theobald, 1920) and has subsequently become a cosmopolitan and highly damaging pest of apple worldwide (Barbagallo et al., 1997). Up to 20 generations per year on apple have been reported. First-instar nymphs (crawlers) are dispersive and walk to new feeding sites, but later developmental stages tend to be more sedentary, forming colonies distinguishable based on their fluffy white protective wax coating (Barbagallo et al., 1997). WAA is able to feed on apple roots, trunks, branches and shoots. Saliva secreted whilst feeding causes cambium cells to divide rapidly, forming a gall which collapses and becomes pulpy under pressure from proliferating cells, making the area vulnerable to fungal infections (Barbagallo et al., 1997; Childs, 1929; Staniland, 1924). Edaphic (soil-dwelling) WAA has a significant negative effect on plant growth, especially of young, nonbearing apple trees where feeding significantly reduces trunk diameter (Brown et al., 1995) that is strongly correlated with fruit production (Waring, 1920). Such widespread and significant damage has made WAA resistance a key objective for rootstock breeding since the early 20th century (Cummins & Aldwinckle, 1983).

Sexual and asexual reproduction in aphid life cycles has significant impacts on population structure and allelic diversity (Delmotte et al., 2002; Dixon, 1977). Species in the subfamily Eriosomatinae typically go through phases of both sexual and parthenogenetic reproduction during the life cycle, but (unlike in other aphid subfamilies) sexual males and egg-laying females lack mouthparts and are therefore unable to feed. The life cycle and mode of reproduction of WAA is somewhat ambiguous. In North America, where WAA populations have been reported to induce leaf galls on American elm (Ulmus americana L.) (Baker, 1915), it has been suggested that WAA has a heteroecious life cycle, alternating between apple (parthenogenetic reproduction) and elm (sexual reproduction). However, it is not clear whether genotypes found on elm are also capable of feeding on apple (Blackman & Eastop, 2000). In other parts of the world, WAA is assumed to be entirely anholocyclic (asexual) on apple (e.g., Dumbleton & Jeffreys, 1938; Eastop, 1966). Sexual males and females have sometimes been observed on apple, but the deposited eggs usually do not hatch, and such populations are assumed to be functionally asexual (Blackman & Eastop, 2000).

The genetic structure of WAA populations also has important practical relevance in the context of host-plant resistance. Four genes associated with WAA resistance in apple (Er1–4) have been identified from various sources (Bus et al., 2010, 2008, 2000; King et al., 1991). Some genotypes of WAA have been observed to feed on rootstocks with Er1-, Er2- and Er3-mediated resistance (Cummins & Aldwinckle, 1983; Rock & Zeiger, 1974; Sandanayaka et al., 2003). The prevalence and spread of such resistance-breaking genotypes within WAA populations has not yet been investigated, and the availability of a full genome sequence will benefit such studies considerably.

Here, we generated a high-quality chromosome-level genome assembly of WAA using a combination of 10X Genomics linked-reads and in vivo chromatin conformation capture (Hi-C) sequencing. Subsequently, gene prediction, functional annotation and phylogenetic analysis were carried out to determine the relationship of WAA within the superfamily Aphidoidea. Our reference genome can provide information about genome organization in the subfamily Eriosomatinae and allows comparative genomic studies for a better understanding of the evolution of aphids.

2 MATERIALS AND METHODS

2.1 Sampling

Aphids were collected from a population feeding on apple trees grown under glass at NIAB EMR. The glasshouse-grown plants were deliberately infested using WAA collected from local orchards. A single colony infesting a glasshouse-grown potted apple plant (a clonally propagated aphid-susceptible breeding line in the rootstock breeding programme at NIAB EMR) was sampled for aphids in November 2018. All sampled aphids were collected from one (12-cm) section of infested woody stem. Wax filaments covering sampled insects were removed using a fine paint brush and the aphids placed in Eppendorf tubes. For genome analysis, we collected individual adults (20 in total) into separate tubes and also generated a pooled sample containing mixed life stages (25 individuals) in a single tube. An additional two samples of grouped aphids (25 in total) were pooled into individual tubes for RNA sequencing (RNA-seq) analysis. These groups consisted of: apterous (wingless) adults; and mid-instar nymphs (mix of second, third and fourth instars). Collected aphids were snap frozen by immersing tubes in liquid nitrogen.

2.2 DNA extraction and sequencing

Total genomic DNA was extracted from a single aphid using an Illustra Nucleon Phytopure kit, with a modification of the manufacturer's protocol (GE Healthcare). The lysis step was performed adding 10 µl of Proteinase K and incubating the sample in a water bath for 2 hr at 55°C. The DNA precipitation was performed using NaAc (3 m) together with isopropanol to increase the DNA yield. DNA quality and quantity were assessed using a Nanodrop spectrophotometer (Thermo Scientific), a Qubit double-stranded DNA BR Assay Kit (Invitrogen, Thermo Fisher Scientific) and Femto fragment analyser (Agilent). 10X Genomics library preparation and Illumina genome sequencing (HiSeq X, 150 bp paired-end) were performed by Novogene Bioinformatics Technology in accordance with standard protocols.

A pool of mixed-stage samples was used to construct a Hi-C chromatin contact map to enable a chromosome-level assembly. Dovetail Genomics created the Hi-C library with the DpnII restriction enzyme following a similar protocol to Lieberman-Aiden et al. (2009). The Hi-C library was sequenced on an Illumina HiSeq X sequencer and 150-bp paired-end reads were generated.

2.3 Genome assembly

To create the de novo assembly, the 10X Genomics linked-read data were assembled using supernova 2.1.1 (Weisenfeld et al., 2017) with the default parameters and the recommended number of reads (--maxreads = 199222871) to produce the pseudohaplotype assembly output (--style = pseudohap). We improved the initial supernova assembly by performing iterative scaffolding using all of the 10X Genomics raw data (364 million reads; Table S1) following the procedure set out in Mathers, Wouters, et al. (2020). Briefly, we performed two rounds of Scaff10x (https://github.com/wtsi-hpag/Scaff10X) with the parameters “-longread 1 -edge 50000 - block 50000,” followed by misassembly detection and correction with tigmint (Jackman et al., 2018). These steps were followed by a final round of scaffolding with Assembly Round-up by Chromium Scaffolding (arcs) (Yeo et al., 2018). We aligned the Hi-C reads to the 10X Genomics assembly using the juicer pipeline (Durand et al., 2016). The assembly was then scaffolded with Hi-C data (Table S1) into chromosome-level organization using the 3d-dna pipeline (Dudchenko et al., 2017), followed by manual correction using Juicebox Assembly Tools (jbat) (Dudchenko et al., 2018). The assembly was polished after jbat review using the 3d-dna seal module to reintegrate genomic content removed from super-scaffolds through false-positive manual editing to create a final scaffolded assembly.

We checked the Hi-C assembly for contamination using the blobtools pipeline version 0.9.19 (Kumar et al., 2013; Laetsch & Blaxter, 2017) by generating taxon annotated GC content-coverage plots (known as “BlobPlots”). Each scaffold was annotated with taxonomy information based on blastn (Basic Local Alignment Search Tool) version 2.2.31 (Camacho et al., 2009) searches against the National Center for Biotechnology Information (NCBI) nucleotide database (nt, downloaded October 13, 2017) with the options “-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' -culling_limit 5 -evalue 1e-25.” To calculate average coverage per scaffold, we mapped the 10X Genomics raw reads, after barcode removal using proc10xg (https://github.com/ucdavis-bioinformatics/proc10xG), to the assembly using bwa-mem version 0.7.7 (Burrow-Wheeler Aligner) (Li, 2013) with default parameters. The resulting BAM file was sorted with samtools version 1.3 (Li et al., 2009) and passed to blobtools along with the table of blastn results. The mitochondrial genome was identified and removed based on alignment to the WAA mitochondrial genome (NCBI accession no. NC_033352.1) with nucmer version 4.0.0.beta2 (Marçais et al., 2018), and patterns of coverage and GC content obtained from blobtools. A frozen release was generated for the final assembly with scaffolds renamed and ordered by size with seqkit version 0.9.1 (Shen et al., 2016). We assessed the quality of the genome assembly by searching for conserved, single copy, arthropod genes (n = 1,066) with Benchmarking Universal Single-Copy Orthologs (busco) version 3.0 (Waterhouse et al., 2018) and by analysis of k-mer spectra with kat version 2.3.2 (K-mer Analysis Toolkit) (Mapleson et al., 2017) using the default k-mer size (k = 27). To generate a k-mer spectrum we compared k-mer content of the raw sequencing reads to the k-mer content of the assembly using kat comp. Using this spectrum we also estimated the WAA genome size, heterozygosity level and genome assembly completeness.

2.4 RNA extraction and sequencing

Total RNA was extracted from two pools of ~25 aphids, one containing adults and one containing midinstar nymphs, collected from the same colony. The sample was ground under liquid nitrogen in a 1.5-ml Eppendorf tube using a plastic pestle. RNA was extracted using Trizol (Sigma) according to the manufacturer's protocol. RNA was further purified using RNeasy with on-column DNAse treatment (Qiagen) according to the manufacturer's protocol and eluted in 100 ml of nuclease-free water. RNA quality was assessed by electrophoresis of 5 μl denatured in formamide on a 1% agarose gel. Purity was assessed using a Nanodrop spectrophotometer (ThermoFisher) to measure the A260/A280 and A260/A230 ratios. Concentration of RNA was measured, and the presence of contaminating DNA was assessed using a Qubit (Lifetech).

Quality control and trimming for adapters and low-quality bases (quality score <30) of the RNA-seq raw reads were performed using fastqc version 0.11.8 (Andrews, 2010) and trim_galore version 0.5.0 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore) respectively.

2.5 Gene prediction

Before running the gene prediction, we identified repeats and transposable elements (TEs) with repeatmasker version 4.0.7 (Tarailo-Graovac & Chen, 2009) using the RepBase Insecta repeat library (Bao et al., 2015) with the parameters “-e ncbi -species insecta -a -xsmall -gff” (Jurka et al., 2005).

We mapped the quality- and adapter-trimmed RNA-seq reads from the two pools of adults and midinstar nymphs (Table S2) to the soft-masked assembly with hisat2 version 2.0.5 (Kim et al., 2015) with the following parameters: “--max-intronlen 25000 --dta-cufflinks” followed by sorting and indexing with samtools version 1.3 (Li et al., 2009). Strand-specific RNA-seq alignments were split by forward and reverse strands and passed to braker2 as separate BAM files. Therefore, we ran braker2 version 2.1.2 (Hoff et al., 2016, 2019) to train augustus (Lomsadze et al., 2014; Stanke et al., 2008) and predict protein-coding genes, incorporating evidence from the RNA-seq alignments and alignment of busco genes with the following parameters “--softmasking --gff3 --prg=gth --gth2traingenes.” After gene prediction, completeness of the gene set was checked with busco using the longest transcript of each gene as the representative transcript.

2.6 Functional annotation

All the unique transcripts were converted to peptide sequence using cufflinks version 2.2.1 (Trapnell et al., 2010). Sequences were searched against the nonredundant NCBI protein database using blastp version 2.6.0 with an E-value cut-off of ≤1 × 10−5. blast2go version 5.0 (Conesa et al., 2005) and interproscan version 2.5.0 (Quevillon et al., 2005) were used to assign Gene Ontology (GO) terms. Protein domains were annotated by searching against the InterPro version 32.0 (Hunter et al., 2012) and Pfam version 27.0 (Punta et al., 2012) databases, using interproscan version 2.5.0 (Quevillon et al., 2005) and hmmer version 3.1 (Finn et al., 2011), respectively. The pathways in which the genes might be involved were assigned by protein–protein blast against the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (release 53), with an E-value cut-off of 1 × 10−5.

2.7 Phylogeny and comparative genomics

Orthologous groups in Aphididae genomes were identified from the predicted protein sequences of WAA and nine other aphid genomes already published (Table S3): Myzus cerasi (Mathers, Mugford, et al., 2020; Thorpe et al., 2018), Myzus persicae (Mathers, Wouters, et al., 2020), Diuraphis noxia (Nicholson et al., 2015), Acyrthosiphon pisum (Mathers, Wouters, et al., 2020), Pentalonia nigronervosa (Mathers, Mugford, et al., 2020), Aphis glycines (Mathers, 2020), Rhopalosiphum maidis (Chen et al., 2019), Rhopalosiphum padi (Thorpe et al., 2018) and Cinara cedri (Julca et al., 2020). As an outgroup, we included the genome of the silverleaf whitefly Bemisia tabaci (Chen et al., 2016). We used the longest transcript to represent the gene model when several transcripts of a gene were annotated. orthofinder version 2.2.3 (Emms & Kelly, 2015, 2019) with diamond version 0.9.14 (Buchfink et al., 2015), Multiple Alignment using Fast Fourier Transform (mafft) version 7.305 (Katoh & Standley, 2013) and fasttree version 2.1.7 (Price, Dehal, & Arkin, 2009, 2010) were used to cluster proteins into orthogroups, reconstruct gene trees and estimate the species tree. The orthofinder species tree was automatically rooted by orthofinder based on informative gene duplications with stride (Emms & Kelly, 2017).

GO analysis of lineage-specific gene families and genes that have undergone lineage-specific duplication was carried out using the bioconductor (Gentleman et al., 2004) package topGO (Alexa & Rahnenführer, 2009). We used a Fisher exact test to identify overrepresented GO terms.

2.8 Synteny analysis

Syntenic blocks of genes were identified between the chromosome-level genome assemblies of WAA, M. persicae, A. pisum and R. maidis (see Table S3 for details of assembly and annotation versions used) using mcscanx version 1.1 (Wang et al., 2012). For each comparison, we carried out an all versus all blast search of annotated protein sequences using blastall version 2.2.22 (Altschul et al., 1990) with the options “-p blastp - e 1e-10 -b 5 -v 5 -m 8” and ran mcscanx with the parameters “-s 5 -b 2,” requiring synteny blocks to contain at least five consecutive genes and to have a gap of no more than 25 genes. mcscanx results were visualized with synvisio (https://synvisio.github.io/#/).

3 RESULTS AND DISCUSSION

3.1 Genome sequencing and assembly

We generated a high-quality chromosome-level genome assembly of WAA using a combination of 10X Genomics linked-reads and in vivo Hi-C data (Figure 1a). In total, we generated 54.72 Gb of 10X Genomics linked-reads and 71.95 Gb of Hi-C reads, corresponding to 167× and 220× coverage of the WAA genome, respectively. Initial de novo assembly of the 10X Genomics linked-reads with supernova produced a contiguous assembly totalling 330 Mb (Table 1; scaffold N50 = 4.16 Mb). We further improved the supernova assembly by iterative scaffolding and misjoin correction with scaff10x (two rounds) and tigmint (one round), increasing the scaffold N50 of the assembly to 4.22 Mb and reducing the number of scaffolds from 8,967 to 7,929, with the longest scaffold spanning 12.58 Mb (Table 1). To generate chromosome length super-scaffolds, we scaffolded the 10X assembly using in vivo Hi-C data. After manual curation, the final assembly comprised 327 Mb, with 91% of the assembly anchored into six chromosomes (Figure 1a), consistent with the WAA 2n karyotype being 12 chromosomes (Gautam & Verma, 1982, 1983; Kulkarni, 1984; Robinson & Chen, 1969). The lengths of the six chromosomes ranged from 29.68 to 71.23 Mb.

Details are in the caption following the image
Chromosome-level genome assembly of the woolly apple aphid (WAA). (a) Heatmap showing the frequency of Hi-C contacts along the WAA genome assembly. Blue lines indicate super scaffolds and green lines show contigs. Genome scaffolds are ordered from longest to shortest with the x- and y-axis showing cumulative length in millions of base pairs (Mb). (b) KAT k-mer plots comparing k-mer content of 10X Genomics raw reads (barcodes removed) with Hi-C assembly. The black area of the graphs represents the distribution of k-mers present in the reads but not in the assembly and the red area represents the distribution of k-mers present in the reads and once in the assembly. Other colours show k-mers found multiple times in the genome assembly. (c) Genome assembly completeness assessed by the recovery of universal single-copy genes (BUSCOs) using the Arthropoda gene set (n = 1,066). busco assessment result for Eriosoma lanigerum (in bold) compared with aphid genomes available from the National Center for Biotechnology Information (NCBI). The species are coloured by aphid tribe (see Figure 2) [Colour figure can be viewed at wileyonlinelibrary.com]
Table 1. Genome summary statistics for each step of the assembly of WWA
Assembly SP SP + SC + TG SP + SC + TG + HC
Base pairs (Mb) 330 330 327
Number of contigs 12,566 12,703 12,065
Contig N50 (Mb) 0.158 0.158 0.158
Number of scaffolds 8,967 7,929 7,146
Scaffold N50 (Mb) 4.164 4.222 62.861
Longest scaffold (Mb) 16.081 12.584 71.231
Percentage of assembly in chromosome length scaffolds 0 0 91
  • Abbreviations: HC, Hi-C; SC, scaff10x; SP, supernova; TG, tigmint.

The WAA genome assembly is accurate, complete and free from contamination. Our k-mer analysis comparing genomic content of the 10X reads (after barcode removal) with the WAA genome assembly reveals little missing single-copy genome content and very low levels of duplicated content caused by the assembly of haplotigs (Figure 1b). Furthermore, our 327-Mb WAA genome assembly is close to the genome size estimate based on k-mer analysis with kat (363 Mb) (Figure 1b; Table S4) and the genome size of Eriosoma americanum (330 Mb) (Finston et al., 1995). This analysis is further supported by high representation of conserved arthropod genes (n = 1,066) in the assembly, with 97% (n = 1,032) found as complete single copies. Indeed, the WAA assembly contains the highest number of conserved single-copy Arthropoda genes of any published aphid genome (Figure 1c). A taxon-annotated GC content-coverage plot (known as a “BlobPlot”; Kumar et al., 2013) revealed the co-assembly of the obligate aphid bacterial symbiont Buchnera aphidicola (Baumann et al., 1995; Douglas, 1998; Hansen & Moran, 2011; Shigenobu & Wilson, 2011) and a secondary symbiont, Serratia symbiotica (Burke & Moran, 2011; Manzano-Marín & Latorre, 2016; Moran et al., 2005) (Figure S1). These bacterial scaffolds were filtered from the final assembly, along with scaffolds showing atypical GC content and read coverage, leaving the final assembly free from obvious contamination (Figure S2). The B. aphidicola genome was assembled into 99 scaffolds 444 kb in length (Table S5). The S. symbiotica genome was fragmented and incomplete (165 kb total length, 30 scaffolds, N50 = 9 kb).

3.2 Genome annotation

We generated 44 Gb of strand-specific RNA-seq data to aid genome annotation. A total of 28,186 protein-coding genes (28,297 transcripts) were predicted in the WAA genome assembly, of which 83.8% (23,627 genes) were located on the six chromosomes (Table S6). Mapping of our RNA-seq data to the annotation revealed a relatively low number of expressed genes: 8,220 (29.1%) gene models were supported by an estimated count of at least 10 reads and 4,576 (16.2%) had estimated expression of least 1 transcript per million (TPM) (Table S7). This is probably due to degradation of our RNA samples. Nonetheless, our annotation contains 97.3% of the busco Arthropoda gene set as complete copies (95.6% complete and single copy; Figure S3) and 18,835 genes (66.6%) have an orthologue in at least one another sequenced aphid species (Figure 2). Additionally, 55.7% (15,477) of the predicted transcripts were functionally annotated with at least one GO term and/or protein domain. Taken together, these analyses indicate that our gene set is complete and accurate. In the future, WAA gene models will be further refined with additional RNA-seq data sets and community-led manual curation.

Details are in the caption following the image
Maximum likelihood phylogeny of Eriosoma lanigerum and nine other aphid species based on a concatenated alignment of 3,079 conserved one-to-one orthologues. The tree is rooted with the whitefly Bemisia tabaci MEAM1 (not shown). Clades are coloured by aphid tribe. Branch lengths are in amino acid substitutions per site. The bar chart shows the gene count for each species with orthology relationships among aphids [Colour figure can be viewed at wileyonlinelibrary.com]

3.3 Phylogeny and comparative genomics

WAA is the first member of the aphid subfamily Eriosomatinae to have its genome sequenced. To place this new genome assembly in a phylogenetic context and to investigate gene family evolution across aphids, we compared the WAA proteome (the complete set of annotated protein-coding genes) to the proteomes of nine other aphid species that have fully sequenced genomes (Table S3) and to the whitefly, Bemisia tabaci MEAM1 (Chen et al., 2016). In total, we clustered 240,702 proteins into 23,294 orthogroups (gene families) and 26,969 singleton genes (Table S8). Maximum likelihood phylogenetic analysis based on a concatenated alignment of 3,079 conserved single-copy genes produced a fully resolved species tree with 100% support at all nodes (Figure 2). Eriosomatini (represented by WAA) is recovered as a sister group to Aphidini + Macrosiphini (Aphidinae), with Lachnini (represented by C. cedri) placed as an outgroup to all other sequenced aphid species (Figure 2).

The number of predicted genes in WAA (28,186) is within the range of other aphid genomes (16,992–31,001) but among the highest (Figure 2). However, predicted gene numbers can vary depending on both the quality of the genome assembly and the different pipelines used for predicting genes (Denton et al., 2014; Yandell & Ence, 2012). Of the 28,186 predicted genes in the WAA genome, 18,738 (66%) have an orthologue in at least one other aphid species and 13,278 (47%) are conserved in the majority (at least 9/10) of aphid species (Figure 2). The high number of genes with orthologues in other aphid species despite the absence of close WAA relatives in our analysis probably reflects the early emergence of many aphid gene families (Julca et al., 2020).

Aphid genomes are also subject to high levels of ongoing gene duplication (Fernández et al., 2020; International Aphid Genomics Consortium, 2010; Julca et al., 2020; Mathers et al., 2017; Thorpe et al., 2018). WAA is no exception and we detect a large number of lineage-specific gene families (689 orthogroups corresponding to 3,954 genes; Figure 2) and identify 9,936 genes that have undergone lineage-specific duplication. These genes are enriched for a diverse set of functions including sensory perception and metabolic process (Tables S9 and S10). As additional Eriosomatini genomes become available, the diversification of these genes and gene families will be investigated in greater detail.

3.4 X chromosome fragmentation in WAA

It has previously been shown that the autosomes of aphids within the tribes Macrosiphini and Aphidini have undergone extensive rearrangement over the last ~30 million years while the aphid sex (X) chromosome (which is haploid in males) has been conserved (Li et al., 2020; Mathers, Wouters, et al., 2020). Given that we now have a chromosome-level genome assembly of an aphid from a third, more divergent tribe, we used our WAA aphid assembly to investigate the evolution of aphid genome structure. We identified syntenic blocks of genes between our WAA assembly and the genomes of Myzus persicae (Figure 3a) and Acyrthosiphon pisum (Figure 3b) from Macrosiphini (Mathers, Wouters, et al., 2020), and Rhopalosiphum maidis (Figure 3c) from Aphidini (Chen et al., 2019). All comparisons reveal high levels of genome rearrangement between WAA chromosomes 1–4 (EL1–EL4) and the autosomes of M. persicae, A. pisum and R. maidis. Surprisingly, however, we find that two WAA chromosomes (EL5 and EL6) map to the conserved Macrosiphini and Aphidini X chromosome, suggesting either fragmentation of the X chromosome in WAA or that the large Aphidinae (Macrosiphini + Aphidini) X chromosome was the result of an ancient fusion event. Additional chromosome-level assemblies of diverse aphid species will be required to test these two competing hypotheses. However, the lack of rearrangements between either EL5 or EL6 and the autosomes suggests that recent X chromosome fragmentation in WAA is more likely. Additionally, it remains to be determined if either EL5 or EL6 (or both) behave as an X chromosome in WAA (i.e., they are found in a haploid state in males). Although populations such as the UK one used here for genome sequencing are thought to be entirely anholocyclic (asexual) on apple (Dumbleton & Jeffreys, 1938; Eastop, 1966), sexual males and females have been observed (Blackman & Eastop, 2000), suggesting the presence of a functional X chromosome. In the future, whole genome resequencing of these rare males may be used to confirm the identity of the WAA X chromosome based on patterns of sequence coverage, as has been carried out for M. persicae and A. pisum (Li et al., 2019; Mathers, Wouters, et al., 2020).

Details are in the caption following the image
Genome reorganization across aphids and fragmentation of the sex (X) chromosome in woolly apple aphid (WAA). Pairwise synteny relationships are shown between WAA and the chromosome-scale genome assemblies of Myzus persicae (a), Acyrthosiphon pisum (b) and Rhopalosiphum maidis (c) (see Figure 2 for phylogenetic relationships between compared species). Links indicate the boundaries of syntenic blocks of genes identified by mcscanx and are colour coded by WAA chromosome ID. Black arrows along chromosomes indicate reverse complement orientation relative to WAA [Colour figure can be viewed at wileyonlinelibrary.com]

4 CONCLUSIONS

WAA is a widespread pest of apple trees that is particularly critical to the economics of the apple industry in most parts of the world. This study provides the first chromosome-level genome of WAA and the genome sequencing of the first representative of the whole subfamily Eriosomatinae. The WAA genome will be useful as a reference for investigation of genetic differences among wild WAA populations. The high quality of the genome will also allow molecular marker development and detection using large-scale genome resequencing. Population resequencing data can be used to investigate genomic regions and genes that show genetic variability and to analyse demographic history events in WAA populations.

Finally, as the only genome available for the subfamily Eriosamatinae and as an additional outgroup to other sequenced aphids from the subfamily Aphidinae, the WAA genome will allow more extensive comparative genomics analysis of aphids.

ACKNOWLEDGEMENTS

T.C.M. is funded by a BBSRC Future Leader Fellowship (BB/R01227X/1). Additional support was received from the BBSRC Institute Strategy Programme (BB/P012574/1) and the John Innes Foundation. C.J.G. acknowledges receipt of a studentship funded by the BBSRC, AHDB and an industry consortium (http://www.ctp-fcr.org/industry-partners/).

    AUTHOR CONTRIBUTIONS

    G.P., S.A.H. and T.C.M. conceived the study. G.P., F.F.F. and C.J.G. provided samples. R.B. and S.T.M. extracted DNA and RNA. R.B., A.S. and T.C.M. performed the genome assembly, gene model prediction, gene annotation and comparative analyses. R.B., C.J.G., G.P., S.A.H. and T.C.M. wrote the manuscript with input from all authors. All authors reviewed the manuscript.

    DATA AVAILABILITY STATEMENT

    The genome raw reads, the RNA sequencing data and the genome assembly are available at the National Center for Biotechnology Information (NCBI) with the BioProject no. PRJNA623270. The genome assembly and annotation and orthogroup clustering results are available for download from Zenodo (https://doi.org/10.5281/zenodo.3797131). The WAA genome assembly and annotation are also available from AphidBase (AphidBase (https://bipaa.genouest.org/sp/eriosoma_lanigerum/).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.