Volume 21, Issue 3 pp. 834-848
RESOURCE ARTICLE
Full Access

Chromosome-level genome of the peach fruit moth Carposina sasakii (Lepidoptera: Carposinidae) provides a resource for evolutionary studies on moths

Li-Jun Cao

Li-Jun Cao

Institute of Plant and Environmental Protection, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China

Search for more papers by this author
Wei Song

Wei Song

Institute of Plant and Environmental Protection, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China

Search for more papers by this author
Lei Yue

Lei Yue

Institute of Plant and Environmental Protection, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China

Search for more papers by this author
Shao-Kun Guo

Shao-Kun Guo

Institute of Plant and Environmental Protection, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China

Search for more papers by this author
Jin-Cui Chen

Jin-Cui Chen

Institute of Plant and Environmental Protection, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China

Search for more papers by this author
Ya-Jun Gong

Ya-Jun Gong

Institute of Plant and Environmental Protection, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China

Search for more papers by this author
Ary Anthony Hoffmann

Ary Anthony Hoffmann

School of BioSciences, Bio21 Institute, University of Melbourne, Parkville, Vic, Australia

Search for more papers by this author
Shu-Jun Wei

Corresponding Author

Shu-Jun Wei

Institute of Plant and Environmental Protection, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China

Correspondence

Shu-Jun Wei, Institute of Plant and Environmental Protection, Beijing Academy of Agriculture and Forestry Sciences, 9 Shuguanghuayuan Middle Road, Haidian District, Beijing 100097, China.

Email: [email protected]

Search for more papers by this author
First published: 23 October 2020
Citations: 10

Abstract

The peach fruit moth (PFM), Carposina sasakii Matsumura, is a major phytophagous orchard pest widely distributed across Northeast Asia. Here, we report the chromosome-level genome for the PFM, representing the first genome for the family Carposinidae, from the lepidopteran superfamily Copromorphoidea. The genome was assembled into 404.83 Mb sequences using PacBio long-read and Illumina short-read sequences, including 275 contigs, with a contig N50 length of 2.62 Mb. All contigs were assembled into 31 linkage groups assisted by the Hi-C technique, including 30 autosomes and a Z chromosome. BUSCO analysis showed that 98.3% of genes were complete and 0.4% of genes were fragmented, while 1.3% of genes were missing in the assembled genome. In total, 21,697 protein-coding genes were predicted, of which 84.80% were functionally annotated. Because of the importance of diapause triggered by photoperiod in PFM, five circadian genes in the PFM as well as in the other related species were annotated, and potential genes related to diapause and photoperiodic reaction were also identified from transcriptome sequencing. In addition, manual annotation of detoxification gene families was undertaken and showed a higher number of glutathione S-transferase (GST) gene in PFM than in most other lepidopterans, in contrast to a lower number of uridine diphosphate (UDP)-glycosyltransferase (UGT) gene, carboxyl/cholinesterases (CCE) gene and cytochrome P450 monooxygenase (P450) gene, suggesting different detoxication pathways in this moth. The high-quality genome provides a resource for comparative evolutionary studies of this moth and its relatives within the context of radiations across Lepidoptera.

1 INTRODUCTION

The peach fruit moth (PFM), Carposina sasakii Matsumura (Lepidoptera: Carposinidae, superfamily Copromorphoidea), is a major phytophagous orchard pest of fruit such as apple, pear, peach, apricot and jujube from the families of Rosaceae and Rhamnaceae (Figure 1). The hatched larvae directly bore into fruit to feed, causing losses in fruit production. PFM is one of the most severe borers on deciduous fruit in northeast Asia. It is also considered a potential risk to fruit production in most parts of the world, although PFM is currently restricted to northeast Asia and far east Russia (Kwon et al., 2018; Wang et al., 2017).

Details are in the caption following the image
Eggs (a), larva (b), cocoons (c) and adult (d) of the peach fruit moth Carposina sasakii (a–d) and the damage symptoms to apple (e–g). The hatched larva bores into apple usually near the calyx with white secreta near the boring hole (e); the damaged apple showing shrinkage (f); damage from larvae boring and developing in the apple (g) [Colour figure can be viewed at wileyonlinelibrary.com]

One possible of reason for the currently restricted distribution of PFM is its sensitivity to environmental factors. PFM has evolved diapause to cope with cold winter conditions and to synchronize its phenology with host plants (Toshima et al., 1961). Both long-day and short-day photoperiods induce diapause in the last instar of PFM larvae, resulting in a diapausing cocoon (Hua et al., 1998; Huang et al., 1976). The life cycle of PFM can be univoltine or bivoltine, depending on photoperiods encountered and environmental factors like humidity (Chiba & Kobayashi, 1985; Kim et al., 2000; Sato & Ishitani, 1976). Temperature also affects the occurrence of PFM through effects on the development rate and the emergence of the overwintering generation from diapause (Kim et al., 2001; Zhang et al., 2016).

The effects of environmental factors as well as photoperiod on the life history of PFM provides an opportunity to investigate the genomic basis of adaptation to temperate environments in the Copromorphoidea superfamily and across Lepidoptera more generally. Candidate genes involved in climatic adaptation could then also be investigated at the geographic level, given that a combination of mtDNA and microsatellite variation indicates strong genetic differentiation among geographical populations of the PFM across its native range in China (Wang et al., 2017). Although adaptation to climate changes has been found in lepidopteran species (van Asch et al., 2013; Fält-Nardman et al., 2016; Quezada García et al., 2015; Yamanaka et al., 2008), genes underlying climate adaptation remain to be identified (Hoffmann, 2017). Analysis of genomic variation among populations may help to identify putative loci and processes under selection (Hoffmann & Sgro, 2011) and predict the evolutionary adaptation under climate changes (Hoffmann & Sgro, 2011).

The highly variable life history of PFM on different host plants may reflect different host-associated biotypes as supported by an analysis of esterase isozyme patterns (Hua & Hua, 1995) and random amplified polymorphic DNA (RAPD; Xu & Hua, 2004), although this is not yet been confirmed by direct studies on population differentiation in PFM (Kwon et al., 2017; Wang et al., 2015). Previous studies have shown that detoxification and chemosensory genes are associated with host adaptation of insects (Heckel, 2018; Rane et al., 2019; Wan et al., 2019). Understanding the genomic features of PFM may provide useful points to investigate host plant adaptation and reveal the genetic basis related to the potential invasiveness of PFM as well as other fruit boring pests (Kirk et al., 2013).

Except for genetic variations, the insect can adapt to the varied environment by plasticity, which often involves the expression change of related genes (Sgro et al., 2016). Transcriptome analysis provides an efficient way to identify the differentially expressed genes. Although the transcripts can be assembled de novo, a reference genome-guided analysis may help to make the analysis more precise.

Well-assembled genomes are increasingly becoming available as resources for tracing evolutionary adaptation across the Lepidoptera. Already there are substantial genomic resources for many moths (Chen et al., 2019; Cheng et al., 2017; Kanost et al., 2016; Lange et al., 2018; Ma et al., 2020; Pearce et al., 2017; Wan et al., 2019; Xia et al., 2004; Xiang et al., 2018; Xiao et al., 2020; You et al., 2013; Zhang et al., 2020) and butterflies (Ahola et al., 2014; Cong et al., 2015; Dasmahapatra et al., 2012; Lu et al., 2019; Nishikawa et al., 2015; Zhan et al., 2011) which are being used in comparative analyses to link genomic changes to phenotypes like the detoxification of compounds encountered in hosts (Rane et al., 2019). Available genomes provide abundant reference points for investigating evolution across the Lepidoptera, although most species sequences so far are from the Papilionoidea, Noctuoidea, Bombycoidea and Pyraloidea, with less genomic information available for the Carposinidae (Copromorphoidea) despite the importance of this group as agricultural pests.

In the present study, we report on a chromosome-level genome of PFM, which was de novo assembled based on sequences obtained from the PacBio and Illumina platforms and assembled at the chromosome level with the Hi-C technique. We compare features of the PFM genome with those of eleven other moths, focusing particularly on detoxification and chemosensory gene families important in host adaptation and pesticide resistance, contributing to ecological niches occupied by species (Rane et al., 2019). As an initial study using the newly assembled genome, we investigate transcriptomic changes induced by long-day and short-day photoperiods that induce diapause in PFM larvae, and we identify genes involved in these responses which are critical to climatic adaptation by PFM.

2 MATERIALS AND METHODS

2.1 Sample collection and rearing

We established a laboratory strain of PFM from 30 larvae collected from an apple orchard in the Beijing area of China in July 2018. This strain was maintained for five generations on apple (Malus pumila Mill) in the laboratory under 25 ± 1°C, a relative humidity of 75 ± 5%, and a photoperiod of 15L: 9D (ND, normal-day condition). Eggs were laid on filter paper and moved to ripe apples before hatching. Larvae developed in the apples and the last (fifth) instar larvae left the fruit to pupate on prepared sawdust. Samples used in genome sequencing and RNA-seq were from this strain.

In order to induce diapause in larvae, we moved batches of newly hatched larvae to long-day and short-day photoperiodic conditions before they bored into apple for feeding. The long-day condition (LD) was set to a photoperiod of 22 L: 2 D, while the short-day condition (SD) was set to a photoperiod of 8 L: 16 D (Hua et al., 1998; Huang et al., 1976). Both treatments were conducted under 25 ± 1°C, and a relative humidity of 75 ± 5%. The last instar larvae leaving the fruit were collected and stored in RNAlater at −80°C (Sigma-Aldrich, St. Louis, USA) for subsequent RNA-seq library construction.

2.2 Genome sequencing

We extracted genomic DNA from 12 male pupae using MagAttract HMW DNA kit (Qiagen, Hilden, Germany) for Illumina library and PacBio library. The paired-end Illumina library with insert sizes of about 500 bp, was constructed using VAHTS Universal DNA Library Prep Kit for Illumina V2 (Vazyme, Nanning, China) and sequenced on an Illumina Novaseq platform to obtain 150 bp paired-end reads. The raw reads generated were filtered by the software Trimmomatic v0.38 (Bolger et al., 2014). After filtering, we obtained 31.02 Gb of short clean reads (coverage: 77.24X). The sequencing data was used to estimate the genome size, heterozygosity and rate of duplication and polish de novo assemblies.

For long-read sequencing, SMRTbell libraries were constructed with Sequel Sequencing Kit 3.0 (Pacific Biosciences, Menlo Park, CA, USA). Long DNA fragments of approximately 20 kb were sequenced on a PacBio Sequel sequencer (Pacific Biosciences, Menlo Park, CA, USA). Four SMRT cells were processed and 55.52 Gb subreads (mean subread length: 18.13 kb, subread N50 length: 32.84 kb, coverage: 138.2X) were obtained for contig-level genome assembly.

To assist the chromosome-level assembly, we used the Hi-C (High-throughput chromosome conformation capture) technique to capture genome-wide chromatin interactions (Belaghzal et al., 2017). Twenty fifth instar larvae were ground in 2% formaldehyde for cross-linking of cellular protein. Chromatin was digested with restriction enzyme MboI overnight. Then, the DNA ends were flatted, marked with biotin-14-dCTP and ligated with bridge linker. The samples were digested with proteinase K and purified by phenol-chloroform extraction. Biotins on unligated DNA fragments ends were removed with T4 DNA polymerase. Fragments were sheared into 200–600 base pairs using an S220 Focused-ultrasonicator (Covaris, USA). Biotin marked DNA fragments were enriched using streptavidin C1 magnetic beads. Illumina library was constructed from the enriched fragments using VAHTS Universal DNA Library Prep Kit for Illumina V2 (Vazyme, Nanning, China) and sequenced on an Illumina Novaseq platform to obtain 150 bp paired-end reads. After removing the low-quality reads, 1,509 million clean reads were retained (coverage: 559.3X).

2.3 Genome survey

We used the k-mer method to survey the genome features of the PFM. The k-mer count histogram was obtained from Illumina paired-end sequencing data using Jellyfish v2.99 (Marçais & Kingsford, 2011) with 17, 21, 25 and 35 mers. Genome size, heterozygosity and rate of duplication were estimated by GenomeScope version 1.0 (Vurture et al., 2017).

2.4 Genome assembly and evaluation

Long reads generated from PacBio sequencing were corrected and assembled using canu version 1.8 (Koren et al., 2017) with default parameters. The initial assembly was polished using Pilon version 1.22 (Walker et al., 2014) with short reads from Illumina paired-end sequencing for three times. Two haplotypes in part of the genome might be assembled as separate primary contigs due to the high degree of heterozygosity (Roach et al., 2018). To corrected these possible allelic contigs, we reassigned the polished assembly using the pipeline Purge Haplotigs to identify pairs of contigs that are syntenic and removed one of them (Roach et al., 2018), resulting in a contig-level genome.

Clean reads sequenced from the Hi-C library were aligned to the contig-level genome with an end-to-end algorithm implemented in Bowtie version 2.3.5 (Langmead & Salzberg, 2012) according to the HiC-Pro strategy (Langmead & Salzberg, 2012; Servant et al., 2015). The Juicer version 1.5 and 3D de novo assembly (3D-DNA) pipelines were used to assemble the contigs into a chromosome-level genome (Dudchenko et al., 2017; Durand et al., 2016). The completeness of the genome was evaluated through the analysis of single-copy orthologues (Simao et al., 2015), implemented in benchmarking universal single-copy orthologues (busco) version 3.0.2 (Simao et al., 2015), based on the insecta_odb9 database (1,658 genes). Synteny between PFM and Cydia pomonella (Lepidoptera: Tortricidae) (Assembly accession: GCA_003425675.2; Wan et al., 2019) and Spodoptera litura (Assembly accession: GCF_002706865.1; Cheng et al., 2017) genomes were analysed using TBtools version 0.58 (Chen et al., 2020).

2.5 Transcriptome sequencing and assembly

To provide evidence of transcripts for genome structure annotation, we conducted RNA-seq for four developmental stages of egg, larva, pupae, and adults (male and female) reared under normal conditions as described above. To identify the differentially expressed genes between normal (ND) and diapausing larvae, we constructed another two RNA-seq libraries for the long-day (LD) and short-day (SD) induced fifth instar larvae. In total, seven RNA-seq libraries were constructed, including one for eggs, three for larvae, one for pupae, one for male adults and one for female adults of PFM. One individual collected from each stage/treatment was used (except for eggs, for which about 100 eggs were used) for RNA-seq and one library was constructed for each treatment. All libraries were prepared using VAHTSTM mRNA-seq V2 Library Prep Kit for Illumina according to the manufacturer's instructions (Vazyme, NanJing, China) and sequenced on an Illumina Novaseq platform to obtain 150 bp paired-end reads. After removing the low quality reads with Trimmomatic version 0.38 (Bolger et al., 2014), the reads were mapped to the chromosome-level genome using Hisat version 2.2.0 (Kim et al., 2019) and assembled with StringTie version 2.1.2 (Pertea et al., 2015). Fragments per kilobase per million (FPKM) values of each annotated gene in each RNA-seq were estimated with cufflinks version 2.2.1 (Kim et al., 2013).

We analysed the differentially expressed gene (DEG) among the fifth instar larvae reared at different photoperiods (SD, ND and LD). The ND was used as a control, while the SD and LD were compared to the ND. Due to the lack of biological replicate, we used cufflinks version 2.2.1 (Kim et al., 2013) in the assessment of DEGs, which allowed one repeat in the analysis. Because we analysed the DEGs for the fifth instar larvae, which is the last developmental stage before diapause, we assumed that the fold change of DEGs related to diapause should be large. Thus, we used a fold-change ≥2 and q-value ≤ 0.05 as the cutoffs of significant DEGs between samples. For significantly expressed genes, up- or downregulated genes in both comparisons (ND vs. LD and ND vs. SD) were considered as genes related to diapause, while the FPKM values specifically high in the SD or LD condition were considered as light-induced genes. Gene expression visualization of DEGs was conducted with the Pheatmap R package.

2.6 Repeat element and non-coding RNA annotation

Repeats and transposable element families in the PFM genome were first detected by RepeatMasker version 4.0.7 pipeline (Tarailo-Graovac & Chen, 2009) against the Insecta repeats within RepBase Update (http://www.girinst.org) and Dfam database (20170127), with RMBlast version 2.10.0 as a search engine. The noncoding RNAs (ncRNA) were annotated by aligning the genomic sequence against rfam version 14.2 (http://rfam.xfam.org/) with blastn. The tRNAs and rRNAs were predicted by tRNAscan-SE and RNAmmer (Lagesen et al., 2007; Lowe & Eddy, 1997).

2.7 Protein-coding gene annotation and filtering

We annotated protein-coding genes using ab initio, RNA-seq-based, and homologue-based methods in the maker version 2.31.10 genome annotation pipeline (Cantarel et al., 2008). Augustus version 3.2.3 (Stanke & Waack, 2003) and snap version 2013-02-16 (Korf, 2004) were used for the ab initio gene prediction. For Augustus, we used the retrained parameters obtained in the above busco analysis of genome assembly by invoking the Augustus retraining option. In the first round of annotation, we ran maker by providing transcriptome assemblies of PFM, protein sequences from eight lepidopteran species (Bombyx mori, Trichoplusia ni, Ostrinia furnacalis, Bombyx mandarina, Galleria mellonella, Spodoptera litura, Helicoverpa armigera, Plutella xylostella) and the Augustus model as evidence. The GFF3 file of first round annotation was used to train parameters of SNAP. In the next three rounds of annotation, GFF3 from the last round, Augustus and snap models were used as evidence.

The annotation results from the maker pipeline were filtered by using gene expression evidence, functional annotation results and annotation edit distance (AED) value. Genes that had a FPKM value great than 0 in any RNA-seq were considered as real genes and retained in further analysis. Functional domains for proteins were identified using InterproScan 5.34–74.0 (Jones et al., 2014) against Pfam database version 32.0 (El-Gebali et al., 2019). The gene models were filtered based on domain content and evidence support following Campbell et al. (2014). Finally, the annotations with AED < 0.75 were removed (Campbell et al., 2014).

Functions of the protein-coding genes and their gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) items were annotated using the software eggNOG-Mapper version 1.0.3 (Huerta-Cepas et al., 2017), a tool for fast functional annotation of novel sequences using precomputed eggNOG-based orthology assignments, against the database EggNOG version 5.0 (Huerta-Cepas et al., 2019).

2.8 Orthology identification and phylogenetic inference

Protein-coding genes from another 11 species of Lepidoptera as well as two species of Coleoptera and two species of Diptera were obtained from the NCBI genomes database for comparative analysis (Table 1). Orthologues were identified using OrthoFinder version 2.2.7 (Emms & Kelly, 2015) under default parameters. The phylogenetic tree was inferred in the OrthoFinder pipeline with an approximately-maximum-likelihood method implemented in FastTree version 2.1.10 (Price et al., 2009) based on a concatenated multiple sequence alignment (MSA) of single-copy genes. The most likely category for each site was set using a Bayesian approach with a gamma prior. Amino acid sequences were aligned in mafft version 7.450 (Katoh & Standley, 2013) with the G-INS-I algorithm.

TABLE 1. Features of chromosome-level genomes in the Lepidoptera
Features Csas Cpom Tni Bmor Slit Mcin Hmel
Genome size (Mb) 399.04 772.89 368.2 431.7 438.32 393 269
Karyotype 2n = 64 2n = 56 2n = 54 2n = 56 2n = 62 2n = 62 2n = 42
No. contigs 275 2,221 26,605 15,018 13,636 49,851 NA
No. scaffolds NA 1717 6,181 7,397 3,597 8,262 3,807
No. CHR* 31A + Z+W 27A + Z+W 26A + Z+W 27A + Z 30A + Z 30A + Z 20A + Z
Contig N50 (kb) 2,620 862.49 621.9 15.5 68.35 13 51
Scaffold N50 (Mb) NA 8.92 14.2 3.7 0.915 0.119 0.277
busco genes (%) 98.20% 98.5 97.8 97.7 98.3 91.5 97.4
Repeat (%) 11.33 42.87 20.5 43.6 31.83 28 24.94
G + C (%) 36.96 37.43 35.6 37.3 36.5 33.0 NA
No. genes 23,227 17,184 14,043 14,623 15,317 16,667 12,669

Note

  • Csas, Carposina sasakii; Cpom, Cydia pomonella; Tni, Trichoplusia ni; Bmor, Bombyx mori; Slit, Spodoptera litura; Mcin, Melitaea cinxia; Hmel, Heliconius melpomene; *A represents auto chromosome; Z and W represent sex chromosomes; NA, not available. Data for all species except for Csas were summarized by Wan et al. (2019).

2.9 Manual annotation of circadian genes

We further manually annotated well-studied circadian genes: period (PER), timeless (TIM), Clock (CLK), cycle (CYC) and cryptochrome (CRY), using blast version 2.2.31 (Altschul et al., 1990). Reference protein sequences of insect circadian genes were obtained from the Uniprot database. Conserved domains within proteins were annotated against the conserved domain database (Lu et al., 2020). Circadian genes of the other 15 insect species were annotated in the same way. For a common domain of three genes (CLK, PER and CYC), a neighbour-joining tree was constructed using mega7 (Kumar et al., 2016) with 500 bootstrap replicates.

2.9.1 Manual annotation of detoxification and chemosensory gene families

We manually annotated five detoxification gene families and four chemosensory gene families, including cytochrome P450 monooxygenase (P450s), glutathione S-transferase (GSTs), carboxyl/cholinesterases (CCEs), uridine diphosphate (UDP)-glycosyltransferases (UGTs), ATP-binding cassette (ABC) transporters, gustatory receptors (GR), ionotropic receptors (IR), odorant-binding proteins (OBP), olfactory receptors (OR) genes. We used the bioinformatic pipeline bitacora (Vizueta et al., 2019) to conduct hmmer version 3.3 (Finn et al., 2011) and blast version 2.2.31 (Altschul et al., 1990) analyses under a full mode. Hits were filtered with a default cutoff E-value of 10e-5. The HMMs of P450 were downloaded from Pfam version 32.0 (El-Gebali et al., 2018), while other HMMs of detoxification gene families were created by hmmer version 3.3 (Finn et al., 2011). Orthologues from Bombyx mori and D. melanogaster were used as evidence. The annotated genes were further filtered manually based on gene length and the presence of conserved domains. The longest transcripts of alternative splicing was kept. Genes with a length shorter than 80 amino acids were removed. Orthologues were aligned with the G-INS-I algorithm implemented in mafft version 7.450 (Katoh & Standley, 2013). A neighbour-joining tree was constructed for each gene family using mega7 (Kumar et al., 2016) with 500 bootstrap replicates.

3 RESULTS AND DISCUSSION

3.1 Features of the assembled genome

The genome size of PFM is estimated to be 338.52–352.59 Mb through k-mer analysis and depend on the k-mers used (k = 17, 21, 25, 35). The k-mer distributions showed double peaks, indicating that this genome has a high rate of duplication and heterozygosity. The estimated heterozygosity ranges from 1.06% to 1.15% and the rate of duplication from 1.95% to 2.06% (Figure 2a). The high heterozygosity of the PFM genome might be caused by the pooling of the samples for short-read sequencing. The heterozygosity of PFM is similar to the Thrips palmi (1.01%–1.32%; Guo et al., 2020) and the caddisfly Stenopsyche tienmushanensis (1.05%–1.10%; Luo et al., 2018), which were also sequenced from pooling DNA of multiple individuals. The heterozygosity of PFM is higher than beet armyworm Spodoptera exigua (0.59%; Zhang et al., 2019), the invading fall webworm (0.75% and 0.83%; Wu et al., 2018) and Cydia pomonella (0.6%; Wan et al., 2019), which were estimated from one individual or multiple individuals of sibling mating lines.

Details are in the caption following the image
Genome features of Carposina sasakii. (a) GenomeScope analysis of genome size, heterozygosity and duplicate rate using k-mers (K = 17) count histogram, indicating a genome size of 338.52 Mb, a heterozygosity of 1.06%, and a duplication rate of 2.06%; (b) Genome-wide all-by-all Hi-C interaction identified 31 linkage groups; (c) Synteny between Carposina sasakii (Csas) and Cydia pomonella (Cpom) and Spodoptera litura (Slit) genomes reveal highly conserved gene order and chromosomal fusion or split events in the three moths [Colour figure can be viewed at wileyonlinelibrary.com]

At the contig level, we assembled the PFM genome into 404.83 Mb sequences, including 275 contigs, with a contig N50 length of 2.62 Mb. Based on contig interaction frequency calculated from the pairs aligned to the contigs, the 275 contigs were clustered into 31 linkage groups (Figure 2b). The longest contig group was 19.1 Mb while the shortest one was 6.30 Mb, with an N50 of 14.39 Mb. BUSCO analysis showed that 98.3% (single-copied gene: 97.4%, duplicated gene: 0.9%) of 1,658 genes were identified as complete, 0.40% of genes were fragmented, while 1.3% of genes were missing in the assembled genome. The genome comprised 36.88% GC base pairs.

Synteny analysis showed that the PFM, S. litura and C. pomonella genome have a highly conserved gene order (Figure 2c). Chromosome 01 of PFM was identified as the Z chromosome based on its high synteny to the Z chromosome of S. litura (Figure 2c). Chromosome 04 of PFM shows high synteny to chromosome 02 of S. litura, which fused with Z chromosome in C. pomonella, forming the neo-Z chromosome (chromosome 01; Wan et al., 2019). In conclusion, PFM has similar chromosomes as S. litura, including 30 autosomes and a Z chromosome. The female-specific W chromosome of PFM was not determined in this genome, since male pupae were used for sequencing. Identification of W chromosome is challenged by its high degenerate, being gene-poor and repeat-rich in Lepidoptera (Bergero & Charlesworth, 2009; Wan et al., 2019). The chromosome-level assembly of the PFM genome provides resources for understanding chromosome evolution in the Lepidoptera (Ahola et al., 2014).

3.2 Genome annotation

We identified 27,598 protein-coding genes in the 1st round of maker annotation. busco analysis revealed 91.5% of the evaluated single-copy genes were identified as complete. After three rounds of maker annotation, the number of genes increased to 50,676, while the proportion of complete single-copy genes was up to 95.0%. After filtering based on gene expression analysis, functional domains and AED values, 21,695 genes remained. busco analysis showed that 94.8% (single-copied gene: 90.5%, duplicated gene: 4.3%) of the evaluated single-copy genes were identified as complete, 3.6% of the genes were fragmented, and 1.6% of the genes were missing in the annotated gene set. In total, 18,397 genes (84.80%) were functionally annotated, of which 5,303 (24.44%) and 3,829 (17.65%) genes were annotated to GO terms and KEGG KOs respectively. We predicted 53 rRNAs, 11,076 tRNAs, 20 small nuclear RNAs, and 48 micro RNAs in the PFM genome based on Rfam databases.

In total, 35.70 Mb (8.95%) of the genome was identified to be repeat DNA. Overall, 179,142 transposable elements (TEs, 35.14 Mb), including 139,294 retroelements (16,818 short interspersed nuclear elements (SINEs), 107,497 long interspersed nuclear elements (LINEs) and 14,979 long terminal repeats (LTR)) and 39,848 DNA transposons were identified (Table 1, Table S1). The PFM genome shows the lowest repeat DNA among seven chromosome-level assembled genomes from the Lepidoptera (Table 1).

3.3 Orthology and phylogenetic relationships of lepidopterans

OrthoFinder assigned 320,821 genes (93.41% of total) to 15,076 orthogroups for the 16 species compared. Fifty percent of the assigned genes were in orthogroups with 28 or more genes (G50 was 28) and were contained in the largest 3,174 orthogroups (O50 was 3,174).

There were 947 single-copy genes with 364,262 reliable sites retained for phylogenetic inference. The topology is congruent with previously inferred phylogenetic relationships of Lepidoptera, in which no representative of the Copromorphoidea was included (Wan et al., 2019). Current molecular phylogenetic studies have not resolved the phylogenetic relationship between Copromorphoidea and Papilionoidea (Mitter et al., 2017). Our result supports the notion that PFM from the Copromorphoidea forms a sister-group relationship to the butterfly D. plexippus (Papilionoidea), rather than a sister group between Copromorphoidea/Papilionoidea and Pyraloidea + (Noctuoidea + Bombycoidea; Figure 3).

Details are in the caption following the image
Phylogenetic tree of PFM with 15 insect genomes including 11 other Lepidoptera. The phylogeny was inferred from 947 single-copy genes with 364,262 reliable sites by an approximately-maximum-likelihood method. All nodes received bootstrap support of 100

We investigated orthogroups shared by PFM and four species of Lepidoptera, representing different clades of the phylogenetic tree of Lepidoptera (Figure S1). There were 7,827 orthogroups (60.5% of 12,938 orthogroups) shared by all five lepidopteran species and 1,549 orthogroups shared by four species except for C. pomonella. We identified 357 orthogroups specific to PFM, fewer than that of B. mori (406), but higher than other three lepidopteran species (Figure S1).

3.4 Evolution of circadian genes

Five core genes of the circadian clock were annotated in the PFM genome and the other reference insect species. The PER gene was not found in currently assembled genomes of Cydia pomonella and Anoplophora glabripennis (Figure 4). Two types of CRY gene were annotated in 16 species, mammalian-type cryptochrome (CRY-m) and Drosophila type cryptochrome (CRY-d). CRY-d is a UV- and blue-light photoreceptor and is not considered as a core component of the circadian clock in D. melanogaster (Goto, 2013). In circadian clock model established from Danaus plexippus (L.), CRY-m function as a transcriptional repressor instead as a photoreceptor (Yuan et al., 2007; Zhu et al., 2008), and it was incorporated into the core component of the circadian clock as a negative regulator, like PER and TIM. For most of the 16 insects, two types of CRY genes were found, while only CRY-m was found in two Coleoptera species, and only CRY-d was found in D. melanogaster. In the wasp Nasonia vitripennis, only CRY-m was found and TIM does not exist in the genome (Schurko et al., 2010). In the honey bee Apis merifella, TIM was lost (Rubin et al., 2006). In the PFM, FPKM values of CRY-m were higher than CRY-d in each stage, indicating that CRY-m may be a major element in the circadian clock of PFM. These considerable differences in the gene composition of the circadian clock indicate a diverse circadian clock model among insects.

Details are in the caption following the image
Schematic arrangement of the domains of five circadian genes including period (PER), timeless (TIM), Clock (CLK), cycle (CYC) and cryptochrome (CRY-m, CRY-d) in Carposina sasakii and other 15 insects. Boxes in different colour show different domains, while numbers under boxes show the postion of domains on protein sequences. Species and their taxonomic status are shown on the left: Tcas, Tribolium castaneum; Agla, Anoplophora glabripennis; Agam, Anopheles gambiae; Dmel, Drosophila melanogaster; Pxy, Plutella xylostella; Cpom, Cydia pomonella; Csas, Carposina sasakii; Dple, Danaus plexippus; Tni, Trichoplusia ni; Slit, Spodoptera litura; Harm, Helicoverpa armigera; Bmor, Bombyx mori; Bman, Bombyx mandarina; Msex, Manduca sexta; Gmel, Galleria mellonella; Ofur, Ostrinia furnacalis [Colour figure can be viewed at wileyonlinelibrary.com]

Domains of circadian genes were conserved among the 16 species (Figure 4). The PAS domains are widely present in photoreceptors and circadian proteins of many eukaryotes (Taylor & Zhulin, 1999). For insects investigated in this study, CLK, CYC and PER genes have two PAS domains. In previous studies, the two PAS domains were defined as PAS-A and PAS-B. In this study, PAS-A is equivalent to PAS and PAS-fold (Figures 4 and 5). Besides PAS domains, CLK and CYC genes have other common domains bHLH before two PAS domains (Figure 4). Both CRY-d and CRY-m gene have a PhrB domain, and TIM gene has a TIMELESS domain (Figure 4). We constructed phylogenetic relationships among the PAS domains for the 16 species. The phylogenetic tree of PAS domains revealed six clades, corresponding to two PAS domains of three genes (CLK, CYC and PER) (Figure 5).

Details are in the caption following the image
Phylogenetic relationships of two PAS domains in three circadian genes: period (PER), Clock (CLK) and cycle (CYC). Each tip is labelled by the name of domain, gene and species. Abbreviations of species are same as in Figure 4. Six clades shaded in different colour reveal two domains of three genes, while one domain of CLK gene has two different types among species. Tips in red show the position of Carposina sasakii [Colour figure can be viewed at wileyonlinelibrary.com]

3.5 Gene expression in diapause and nondiapause PFM

Compared with larvae that developed under a normal day photoperiod, 11 genes were significantly upregulated and nine genes were downregulated in larvae that developed under long-day or short-day photoperiods (Table S2, Figure S2). Genes highly expressed in prediapause larvae (SD and LD photoperiod) included genes encoding CUSOD2 (CS_07203), an enzyme that destroys radicals, and that plays an important role in diapause and cold tolerance of insect (Bi et al., 2014; He et al., 2013; Isobe et al., 2006; Kim et al., 2010; Sim & Denlinger, 2011; Zhao & Shi, 2009). We identified a cytochrome P450 gene (CS_20496) showing weak expression in prediapause larvae and high expression in the other stages, which was also found in diapausing larvae of the wild silk moth, Antheraea yamamai (Yang et al., 2008).

We identified 44 genes specifically upregulated under a long-day photoperiod, and 14 genes specifically upregulated under a short-day photoperiod (Table S2, Figure S2). Four genes (CS_04235, CS_05017, CS_15183, CS_01854) related to digestion of proteins were upregulated in larvae developing under a long-day photoperiod. This is congruent with previous reports suggesting that the photoperiod had significant effects on digestive enzyme activity (Espinosa-Chaurand et al., 2017; Ramzanzadeh et al., 2016; Shan et al., 2008; Subala & Shivakumar, 2017). The functional link of many of these genes to diapause is not really clear. The circadian genes, which are important in diapause in another moth (Kozak et al., 2019), did not show significant changes for larvae under different photoperiods.

3.6 Evolution of detoxification and chemoreceptor genes

We manually identified 95 P450s, 76 GSTs, 63 CCEs, 27 UGTs, and 93 ABCs in the PFM genome (Table 2; Figure S3). PFM had the lowest number of UGT genes and CCE genes, along with two other moths located at basal lineages of the Lepidoptera. The number of P450 genes in PFM are the second lowest, only slightly higher than Bombyx mandarina. The number of ABCs genes in PFM are at an intermediate level. We found that the PFM had the highest number of GST genes. These results suggest that PFM may have a unique way of detoxication with reduced importance of UGT and P450 when compared to the other moths. This may have implications for pesticide responses in PFM give that these detoxification genes can respond in different ways to various pesticides in moths (Hu et al., 2019).

TABLE 2. Number of genes in five detoxification families across species of Lepidoptera
Species P450 GST CCE ABC UGT GR IR OBP OR Reference
Pxyl 163 55 85 219 38 68 61 91 225 You et al. (2013)
Csas 95 76 63 93 27 41 49 56 176 This study
Cpom 136 30 73 47 30 65 39 50 85 Wan et al. (2019)
Dple 107 35 73 76 47 74 62 68 253 Zhan et al. (2011)
Tni 143 51 122 71 68 85 74 91 274 Chen et al. (2019)
Slit 182 66 153 223 64 107 62 87 261 Cheng et al. (2017)
Harm 122 57 105 76 54 120 71 79 257 Song et al., (2015)
Bman 94 37 94 64 48 69 56 74 218 Xiang et al. (2018)
Bmor 156 51 149 108 50 85 57 93 243 Xia et al. (2004)
Msex 164 66 137 103 51 89 76 79 261 Kanost et al. (2016)
Gmel 137 44 75 72 58 95 81 64 295 Lange et al. (2018)
Ofur 126 48 115 112 46 93 67 75 270 Ma et al. (2020)
  • a Data from Wan et al. (2019); The other data were manually identified in our study. Pxyl, Plutella xylostella; Csas, Carposina sasakii; Cpom, Cydia pomonella; Dple, Danaus plexippus; Tni, Trichoplusia ni; Slit, Spodoptera litura; Harm, Helicoverpa armigera; Bman, Bombyx mandarina; Bmor, Bombyx mori; Msex, Manduca sexta; Gmel, Galleria mellonella; Ofur, Ostrinia furnacalis. P450,Cytochrome P450 monooxygenase; GST, glutathione S-transferase; CCEs, carboxyl/cholinesterases; UGT, uridine diphosphate-glycosyltransferases; ABC, ATP-binding cassette transporters; GR, gustatory receptors; IR, ionotropic receptors; OBP, odorant-binding proteins; OR, olfactory receptors.

The chemosensory system plays an important role in locating food, shelter, mates, and oviposition sites (Wan et al., 2019). We manually annotated several well-studied gene families of the chemosensory system; we identified 41 GRs, 49 IRs, 56 OBPs, and 176 ORs in the PFM genome (Table 2; Figure S4). PFM had the lowest number of GR genes and the second-lowest number of IR genes, OBP genes, and OR genes following the C. pomonella when compared to the other lepidopterans. Both PFM and C. pomonella have a lower number of chemoreceptor genes. The PFM and C. pomonella mainly feed on apple and pear, with a narrow host range. Females of these two moths lay eggs on the surface of their host fruits, and the hatched larvae bore into the fruit directly without long-time searching behavior (van der Geest & Evenhuis, 1991). The narrow host range and feeding behavior of larva may explain the lower number of chemoreceptor genes in their genomes.

In conclusion, we assembled the chromosome-level genome for the PFM using PacBio long-read and Hi-C technology. This is the first assembled genome for the superfamily Copromorphoidea. This novel genomic resource allowed us to explore possible genes in PFM associated with adaptation to environmental factors. We identified five core genes relating to the circadian rhythm in PFM and annotated models for each gene. Using the genome as a reference, we investigated DEGs related to the diapause of OFM, which may point to candidate genes. Given the expression of long-day and short-day diapause by PFM, this moth species will be a useful model to further investigate adaptive shifts involving diapause, particularly by combining genomic information with intraspecific comparisons across geographic gradients (Ragland et al., 2019). The assembled genome provides a resource for further comparative studies of moths and butterflies, particularly with respect to life cycle evolution and parallel evolution in detoxification and chemosensory functions.

ACKNOWLEDGEMENTS

We thank Qiang Gao for his help on the assembly of Hi-C data and Qiang Gong for the assistance of insect rearing. This research was supported by the National Natural Science Foundation of China (31901884, 32070464), National Key Research and Development Program of China (2019YFD1002102), the Natural Science Foundation of Beijing Municipality (6184037), Joint Laboratory of Pest Control Research Between China and Australia (Z201100008320013), and the Beijing Key Laboratory of Environmentally Friendly Pest Management on Northern Fruits (BZ0432).

    AUTHOR CONTRIBUTIONS

    S.-J. Wei conceived and designed the study; J.-C. Chen and Y.-J. Gong conducted the collection and rearing of the insect; L.-J. Cao conducted the molecular works; L.-J. Cao, W. Song, L.-Yue, S.-K. Guo, and S.-J. Wei analysed the data; L.-J. Cao, S.-J. Wei, and A. Hoffmann discussed the results and wrote the manuscript.

    DATA AVAILABILITY STATEMENT

    The whole genome assembly has been deposited in the Genome repository of NCBI (accession numbers: CP053148CP053178) under BioProject PRJNA627116. Raw reads obtained for genome assembly were deposited in the Sequence Read Archive (SRA) repository (accession numbers: SRR12328811 and SRR12336732). Scripts used for assembling and annotating the genome and sequences of manually annotated gene families were deposited in the Dryad repository (https://doi.org/10.5061/dryad.m0cfxpp1j).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.