Volume 21, Issue 1 pp. 327-339
RESOURCE ARTICLE
Full Access

The genome sequence of Samia ricini, a new model species of lepidopteran insect

Jung Lee

Corresponding Author

Jung Lee

Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan

Department of Life Science, Faculty of Science/Graduate School of Science, Gakushuin University, Tokyo, Japan

Correspondence

Jung Lee, Department of Life Science, Faculty of Science/Graduate School of Science, Gakushuin University, Mejiro 1-5-1, Toshima-ku, Tokyo 171-8588, Japan.

Email: [email protected]

Search for more papers by this author
Tomoaki Nishiyama

Tomoaki Nishiyama

Advanced Science Research Center, Kanazawa University, Kanazawa, Japan

Search for more papers by this author
Shuji Shigenobu

Shuji Shigenobu

Functional Genomics Facility, National Institute for Basic Biology, Okazaki, Japan

Search for more papers by this author
Katsushi Yamaguchi

Katsushi Yamaguchi

Functional Genomics Facility, National Institute for Basic Biology, Okazaki, Japan

Search for more papers by this author
Yutaka Suzuki

Yutaka Suzuki

Laboratory of Systems Genomics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan

Search for more papers by this author
Toru Shimada

Toru Shimada

Department of Life Science, Faculty of Science/Graduate School of Science, Gakushuin University, Tokyo, Japan

Search for more papers by this author
Susumu Katsuma

Susumu Katsuma

Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan

Search for more papers by this author
Takashi Kiuchi

Takashi Kiuchi

Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan

Search for more papers by this author
First published: 28 September 2020
Citations: 19

Abstract

Samia ricini, a gigantic saturniid moth, has the potential to be a novel lepidopteran model species. Samia ricini is far more resistant to diseases than the current model species Bombyx mori, and therefore can be more easily reared. In addition, genetic resources available for S. ricini rival those for B. mori: at least 26 ecoraces of S. ricini are reported and S. ricini can hybridize with wild Samia species, which are distributed throughout Asian countries, and produce fertile progenies. Physiological traits such as food preference, integument colour and larval spot pattern differ among S. ricini strains and wild Samia species so that those traits can be targeted in forward genetic analyses. To facilitate genetic research in S. ricini, we determined its whole genome sequence. The assembled genome of S. ricini was 458 Mb with 155 scaffolds, and the scaffold N50 length of the assembly was ~ 21 Mb. In total, 16,702 protein coding genes were predicted. While the S. ricini genome was mostly collinear with that of B. mori with some rearrangements and few S. ricini-specific genes were discovered, chorion genes and fibroin genes seemed to have expanded in the S. ricini lineage. As the first step of genetic analyses, causal genes for “Blue,” “Yellow,” “Spot,” and “Red cocoon” phenotypes were mapped to chromosomes.

1 INTRODUCTION

Bombyx mori has long been the predominant model organism in Lepidoptera and has allowed researchers to make remarkable discoveries. For example, Toyama (1906) confirmed Mendel's laws of heredity to be valid for B. mori and this was the first case showing the validity of Mendel's laws for an animal species. When Beadle and Tatum (1941) proposed the “one gene–one enzyme hypothesis”, Kikkawa (1941) almost simultaneously reached a similar concept by using egg colour mutants of B. mori. There is no doubt that the availability of hundreds of mutant strains contributed to those discoveries. Since the whole genome sequence of B. mori was determined (International Silkworm Genome Consortium, 2008), the tractability of B. mori as a model species has increased significantly. Whole genome sequences of numerous other lepidopteran species, including Papilio polytes, Danaus plexippus and Lymantria dispar, are also now available and these genome sequences have enabled important studies in these species (e.g., Gu et al., 2019; Nishikawa et al., 2015; Zhang et al., 2019). However, it is typically the case that genome-sequenced lepidopteran species have few mutant strains, and, to date, the feasibility of forward genetic analyses in these species remains to be established.

Bombyx mori and its wild relative Bombyx mandarina have some unique characteristics that distinguish them from many lepidopteran species. For example, B. mori and B. mandarina diapause at the embryonic stage while most lepidopteran species diapause at the larval or pupal stages. Their plant host habitat is also unique. B. mori and B. mandarina are oligophagous: only a small subset of Morus spp., including M. alba, M. bombycis and M. latifolia, act as their food plants. On the other hand, as represented by agricultural pests, such as Plutella xylostella, Spodoptera litura and Helicoverpa armigera, many lepidopteran species are polyphagous. For a better understanding of biological characteristics and the evolutionary trajectories of lepidopteran species, it is necessary to investigate additional model species.

One alternative model candidate in genetic research is Samia ricini (Figure 1), also known as “Eri silkmoth,” which is the only Saturniid species that is fully domesticated (Peigler & Naumann, 2003). Previous reports have suggested that S. ricini was derived from Samia canningi, a wild Samia species (Peigler & Naumann, 2003; Singh et al., 2017). Samia ricini originated in Assam, India, but it has been artificially transferred to many Asian countries as well as to other regions (Peigler & Naumann, 2003). Although this species has been fully domesticated for the purpose of silk production, it still retains some traits that have been lost in B. mori, such as foraging ability and adult flight ability. The sex determination system also differs between B. mori and S. ricini (Traut et al., 2007); while W chromosome-derived PIWI-interacting RNA determines femaleness of B. mori (Kiuchi et al., 2014), S. ricini has a ZZ/Z0 sex determination system and lacks the W chromosome (Yoshido et al., 2011).

Details are in the caption following the image
Graphical view of Samia ricini. (a) Fifth-instar larva. (b) Adult male moth. Scale bar: 10 mm [Colour figure can be viewed at wileyonlinelibrary.com]

Key advantages of utilizing S. ricini for genetic research lie in its intraspecies genetic diversity and ability for interspecific hybridization: S. ricini reportedly consists of at least 26 morphologically different ecoraces (Singh et al., 2017). In addition, S. ricini is able to produce fertile hybrids with wild Samia species (Brahma et al., 2015; Peigler & Naumann, 2003), such as S. canningi or S. cynthia pryeri. Populations of S. canningi and S. c. pryeri are distributed throughout South and East Asian countries, and natural variation in traits such as larval integument colour, larval marking patterns, cocoon colour and host plant preference can be observed among populations (Brahma et al., 2015; Peigler & Naumann, 2003).

Another advantage of S. ricini is that they are relatively easy to rear. Samia ricini is a multivoltine species whereas most saturniid species are univoltine or bivoltine (Brahma et al., 2015; Sternburg & Waldbauer, 1984), which means that research into S. ricini is free from seasonal limitations (Singh et al., 2017). In addition, S. ricini grows uniformly and can be reared synchronously at large scale, resulting in efficient egg production. We have already succeeded in establishing a genome-editing system in this species using transcription activator-like effector nucleases (TALENs) and have successfully obtained several gene knockout lines (Lee et al., 2018), meaning that functional analysis of genes of interest is now achievable.

To facilitate genetic research of S. ricini, we have determined its whole genome sequence. We used both long-read and short-read sequencers, namely the Pacbio Sequel system and Illumina HiSeq1500, to construct a high-quality genome assembly. After the assembly was completed, we attempted to identify the responsible chromosomes for multiple larval phenotypes in S. ricini and S. c. pryeri as an initial feasibility test for forward genetic research in this system.

2 MATERIALS AND METHODS

2.1 Insects

The UT strain of Samia ricini and Nagano strain of S. c. pryeri larvae were provided by the National BioResource Project (NBRP; http://shigen.nig.ac.jp/wildmoth/). Samia ricini larvae were reared on Ricinus communis leaves under long-day conditions (16-hr light/8-hr dark) at 25°C. S. c. pryeri larvae were reared on Ailanthus altissima with the same photoperiod and temperature conditions. F1 interspecific hybrids were obtained by crossing S. c. pryeri females and S. ricini males. F1 individuals and BC1 individuals were reared under the same conditions as S. ricini larvae.

2.2 DNA sample preparation for whole genome sequencing

Posterior silk glands were sampled from fifth-instar larvae. Genomic DNA was prepared using Genomic-tip 100/G (Qiagen) according to the manufacturer's protocol.

2.3 Library preparation and genome sequencing

For whole genome sequencing, the PacBio Sequel System (Pacific Bioscience) and Illumina HiSeq (Illumina) were employed. For PacBio, a 20-kb library was prepared and four SMRT cells were used for sequencing; 26.8 Gb in 3,267,255 subreads were obtained (Table S1). Illumina paired-end and mate-pair libraries were prepared using the Illumina PCR-Free library prep kit, Nextera Mate Pair library prep kit and Kapa Hyper Prep kit. Paired-end libraries were constructed from DNA fragmented with Covaris S2 and separated with an agarose gel at 200–250 bp (male ZZ) and with Sage-ELF at 310–530 bp (female ZO). Mate-pair libraries were separated with CHEF-electrophoresis after tagmentation, and DNA recovered from gel slices 3 to ~40 kb was used for the subsequent process. All the libraries had different indexes and were combined for sequencing in the same lane. Paired-end sequencing was performed for 126 bp from both ends on an Illumina HiSeq1500 high-output mode with version 4 chemistry at the National Institute for Basic Biology (NIBB). In total, 401,799,912 read pairs were obtained, with further details of sequencing libraries and output summarized in Table S2.

2.4 RNA sequencing

Embryo-derived libraries for RNA sequencing (RNA-seq) were prepared using a TruSeq RNA Library Prep Kit (Illumina) and were sequenced using the Illumina HiSeq 2500 platform with 100- and 101-bp paired-end reads. The library for midgut-derived RNA samples was prepared using a TruSeq RNA Library Prep Kit (Illumina) and sequenced using the Genome Analyser IIx System with 76-bp paired-end reads. The libraries for anterior silk gland- and middle silk gland-derived RNA samples were prepared using a SureSelect Strand Specific RNA Library Prep Kit (Agilent) and were sequenced using the Illumina HiSeq 2500 platform with 100-bp paired-end reads. Table S3 summarizes information on the results of RNA-seq.

2.5 Quality check and trimming

The quality of Illumina short reads was examined using fastqc version 0.11.3. Based on the quality check results, trimming of reads were conducted using trimmomatic version 0.36 (Bolger et al., 2014).

2.6 Heterozygosity assessment

Using jellyfish version 2.2.3 (Marçais & Kingsford, 2011) and web-enabled version of genomescope (http://qb.cshl.edu/genomescope/; Vurture et al., 2017), heterozygosity in one of the sequenced individuals was estimated. For comparison, heterozygosity of Antheraea yamamai (Kim et al., 2018) was also estimated. The k-mer value for jellyfish was set to k = 31. Short read data used for heterozygosity assessment are available under accession nos. DRR213145 (S. ricini) and SRR5641445 (A. yamamai).

2.7 Genome assembly and completeness assessment

Long reads derived from the Sequel System were assembled using the HGAP4 pipeline bundled in smrtlink version 5.0.1. To construct consensus sequences from draft contigs from HGAP4 (Chin et al., 2016), racon version 1.2.0 (Vaser et al., 2016) with minimap version 0.2 (Li, 2016) was employed. racon treatment was repeated until the output fasta file showed no difference from that of the previous run. In this case, four repeats were sufficient for convergence. Then, to polish the assembly, pilon version 1.21 (Walker et al., 2014) was utilized with Illumina short reads. This final assembly was deposited at DDBJ (with accession nos. BLXV01000001–BLXV01000155). The completeness of the final assembly was assessed using busco version 3.0.2 (Waterhouse et al., 2018). For comparison, the latest genome assemblies of four lepidopteran species, including Bombyx mori (BHWX01000001–BHWX01000696), Papilio xuthus (GCA_000836235.1), Danaus plexippus (GCA_000235995.2) and Plutella xylostella (http://download.lepbase.org/v4/sequence/Plutella_xylostella_pacbiov1_-_scaffolds.fa.gz) were also submitted to busco (Table S4).

2.8 Linkage analysis of scaffolds

To clarify the linkages between scaffolds, we adapted a classical genetic approach. First, we obtained backcross generation 1 (BC1) individuals between S. ricini and S. c. pryeri, a closely related species to S. ricini. The crossing scheme was (S. c. pryeri × S. ricini) × S. ricini. Because meiotic recombination does not occur in ovaries of lepidopteran species (Marec, 1996; Yoshido et al., 2011), chromosomes in BC1 individuals should be S. riciniS. c. pryeri heterozygotes or S. riciniS. ricini homozygotes.

We designed 35 PCR-based genetic markers which can specifically detect 35 scaffolds longer than 1 Mb and can molecularly distinguish S. ricini and S. c. pryeri (Figure S1a) and then performed genomic PCR. Genomic DNA was extracted from the legs of eight BC1 larvae using the DNeasy Blood and Tissue Kit (Qiagen). The genomic PCR programme was as follows: 40 cycles of 10 s at 98°C, 5 s at 60°C and 5 s at 68°C. KOD One PCR Master Mix (TOYOBO) was used to perform genomic PCR. Then, the allele combinations of scaffolds in eight BC1 individuals were examined. According to the result, we designated the identified linkage groups according to Yoshido et al. (2011). Electrophoresis was conducted using a 2.0% agarose gel or MultiNA microchip electrophoresis system (Shimadzu). Table S5 lists all primers used for linkage analysis.

2.9 Repeat identification and comparative analysis

To identify the repeat elements of the S. ricini genome, a custom repeat library was constructed using repeatmodeler version 1.0.11 (http://www.repeatmasker.org/RepeatModeler/) with recon version 1.0.8 (Bao & Eddy, 2002), repeatscout version 1.0.5 (Price et al., 2005) and trf version 4.0.4 (Benson, 1999). To mask and annotate repetitive sequences in S. ricini, the constructed custom repeat library was utilized by repeatmasker version 4.0.7 (http://www.repeatmasker.org/RMDownload.html; Tarailo-Graovac & Chen, 2009) with repbase version 20170127 (Jurka et al., 2005) and the rmblast engine (http://www.repeatmasker.org/RMBlast.html).

2.10 Gene prediction

The braker2 pipeline (Camacho et al., 2009; Hoff et al., 2016, 2019; Lomsadze et al., 2014; Stanke et al., 2006, 2008) was employed for gene prediction. First, repetitive sequences in the genome identified by repeatmasker were soft-masked. To generate extrinsic evidence for gene prediction, 11 sets of RNA-sequencing (RNA-seq) reads (Table S3) were mapped to the genome sequence using hisat2 (Kim et al., 2015). The resultant BAM files generated by hisat2 were submitted to braker2 by using “--bam” and “--softmasking” options. In parallel, we assembled the RNA-seq reads using the trinity assembler (Haas et al., 2013). Then, the tr2aacds.pl program bundled in the EvidentialGene suite (http://arthropods.eugenes.org/EvidentialGene/evigene/) was used to merge the assemblies from multiple transcriptome data sets. The merged transcriptome assemblies were aligned to the genome sequence using pasa (Haas et al., 2008, 2013) for identifying the exon regions. In addition to tr2aacds.pl program, stringtie (Pertea et al., 2015) was also used to merge multiple transcriptome data for exon prediction. In addition, amino acid sequences of manually annotated sequences of S. ricini deposited in the Universal Protein Resource database (UniProt, http://www.uniprot.org) (Bateman, 2019) were aligned to the genome sequence using exonerate version 2.2.0 (Slater & Birney, 2005) to obtain protein spliced alignment information. Finally, the multiple predictions generated by braker2, pasa, stringtie and exonerate were integrated using evidencemodeler (Haas et al., 2008). To assess the completeness of gene prediction, predicted gene sets were also submitted to busco version 3.0.2 (Waterhouse et al., 2018).

2.11 Functional annotation

Amino acid sequences of the predicted genes were aligned to the Uniprot database with the blastp program (Camacho et al., 2009). Protein classification and domain searches were achieved via the interproscan program (Finn et al., 2017) with Pfam database (El-Gebali et al., 2019). These analyses were done in OmicsBox software through trial mode (Conesa & Götz, 2008).

2.12 Comparative genome analysis

To identify orthologue groups among multiple species, including B. mori, D. plexippus, Papilio xuthus and Plutella xylostella, orthofinder (Emms & Kelly, 2015) was used. Each gene set corresponded to the genome assembly, which was used for busco analysis (Table S4). With regard to D. plexippus, Papilio xuthus and Plutella xylostella, the proteome data were obtained from Lepbase version 4 (http://lepbase.org) (Challis et al., 2016). The proteome data of B. mori were obtained from SilkBase (http://silkbase.ab.a.u-tokyo.ac.jp/cgi-bin/download.cgi).

2.13 Drawing a circular ideogram for Bombyx mori and Samia ricini genomes

To assess the similarity of B. mori and S. ricini genomes, a circular ideogram was drawn using clico (Cheong et al., 2015) with the circos program (Krzywinski et al., 2009). Single-copy orthologues, identified by orthofinder in each genome, were connected. To simplify the ideogram, short scaffolds in the B. mori genome assembly which were not assigned to 28 chromosomes were filtered out.

2.14 Identifying the chorion gene cluster and phylogenetic analysis of chorion genes

orthofinder found that chorion genes were tandemly arrayed on chromosome 1 (Chr. 1). For more detailed information, we performed blastp searches against the NCBI nonredundant protein database, as well as the Uniprot database (Bateman, 2019) with an e-value cut-off of less than 1e-5. The predicted gene models within and around the chorion gene region were used as query sequences. As a result, 80 chorion genes were found in a cluster on Chr. 1. Within this cluster, five nonchorion gene models (evm.model.Sr_HGAP_JL_scaf_2.1123, 1128, 1135, 1136 and 1137) were also identified (Table S6). Phylogenetic analysis of chorion genes was conducted with 80 S. ricini chorion genes, 121 B. mori chorion genes, 21 Plutella xylostella chorion genes, 29 Papilio xuthus chorion genes, 24 D. plexippus chorion genes registered at the Uniprot and NCBI database and one nonchorion gene (evm.model.Sr_HGAP_JL_scaf_2.1135) as an outgroup. muscle was used to generate alignments of protein sequences (Edgar, 2004). Aligned sequences were subjected to phylogenetic analysis by maximum likelihood and ultrafast bootstrap methods (Minh et al., 2013) with 1,000 replicates using iq-treeversion 1.5.5 (Nguyen et al., 2015). The phylogenetic tree was constructed based on the PMB+F+R5 model.

To check whether Sricini has high-cysteine chorion gene or not, amino acid sequences of 38 high-cysteine chorion proteins of B. mori were aligned to deduced amino acid sequences of 80 S. ricini chorion genes via the blastp program.

2.15 Identifying fibroin and sericin genes in the Samia ricini genome

The Fib-H (BAQ55621.1) and p25 (LC001863.1, LC001864.1 and LC001865.1) genes of S. ricini are already registered in GenBank and were used as queries in blastp searches against 16,702 gene models of S. ricini using an e-value cut-off of less than 1e-5 and “-seg no” option. Where no blastp hits were reported, tblastn searches against nucleotide sequences of the S. ricini genome were conducted with the same filtering parameters. To investigate whether the homologue of Fib-L is present or not in the S. ricini genome, B. mori Fib-L (NP_001037488.1) was utilized as a query for blastp and tblastn searches. In addition, we performed tblastn searches against the A. yamamai genome using B. mori Fib-L sequence as query.

Tsubota et al. (2015) and Dong et al. (2015) reported that five and four sericin genes are expressed in anterior silk gland and middle silk gland, respectively (Table S7). The deduced amino acid sequences of putative sericin transcripts were submitted to the gene model set of S. ricini through blastp. Regarding LC001867 and LC001870, because the corresponding gene models were not found, tblastn was conducted to confirm whether both transcripts were present or not.

When we tried to determine the repertoire of silk protein-encoding genes in D. plexippus and Plutella xylostella, tblastn searches against the genome assemblies were conducted with B. mori Fib-H (NP_001106733.1), Fib-L, p25 (NP_001139413.1) and sericin-1, 2, 3 (AB112019.1, NP_001166287.1, NP_001108116.1) sequences as queries because no transcripts or amino acid sequences were previously reported as Fib-H, Fib-L, p25 and sericin in Plutella xylostella and D. plexippus. Genome assemblies which were used for tblastn searches were those used in busco analysis (Table S4). As the transcripts of Fib-H, Fib-L and p25 of Papilio xuthus were already registered (see Table 3), those sequences were mapped to the Papilio xuthus genome sequence to confirm the presence. Regarding sericin genes in Papilio xuthus, no sequences were previously registered in GenBank, and thus the same procedure as in Plutella xylostella and D. plexippus was taken. Phylogenetic analysis of sericin was conducted with seven S. ricini putative sericin genes, three B. mori sericin genes and five A. yamamai sericin genes (LC08587, LC08588, LC08589, LC08590 and LC08591; Zurovec et al., 2016). muscle was used to generate alignments of protein sequences (Edgar, 2004). Aligned sequences were subjected to phylogenetic analysis by maximum likelihood and bootstrap methods with 1,000 replicates using megax (Kumar et al., 2018). The maximum likelihood tree under the Whelan And Goldman + Freq. model (Whelan & Goldman, 2001) was inferred. Nearest-Neighbor-Interchange (NNI) was used for heuristic tree searching. All sites including those containing gaps were used for the analysis.

2.16 Identifying responsible chromosomes of “Blue,” “Yellow,” “Spot” and “Red cocoon” phenotypes in BC1 individuals

BC1 individuals were phenotyped for one of four morphological traits: “Blue,” “Yellow,” “Spot” and “Red cocoon.” The genetic markers designed for scaffold linkage analysis were utilized in segregation analysis (see Table S5). A DNeasy Blood and Tissue kit (Qiagen) and MightyAmp DNA Polymerase Version 3 (TaKaRa) was used for DNA extraction and genomic PCR of BC1 individuals, respectively. The genomic PCR programme was as follows: 2 min at 98°C and 40 cycles of 10 s at 98°C, 15 s at 60°C and 1 min at 68°C.

3 RESULTS AND DISCUSSION

3.1 Overview of Samia ricini genome assembly

The final assembly of the Samia ricini genome was 450,479,495 bp long with 155 scaffolds. Because k-mer analysis (k = 31) estimated that the genome size ranges from 439,526,288 to 439,568,542 bp, we concluded that the assembled genome size of 450 Mb was reasonable. The N50 length of the assembly was ~21 Mb (Table 1). GC content was 34.3%. The longest scaffold length was ~34 Mb. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis using busco version 3.0.2 with insecta odb9, including 1,658 BUSCOs from 42 species, revealed that 97.9% of BUSCOs were completely detected in the assembled genome (1,615, complete and single-copy; eight, complete and duplicated) among 1,658 tested BUSCOs (see Table S4). To the best of our knowledge, these statistic scores are the best among currently available lepidopteran genome assemblies (Challis et al., 2016; Kim et al., 2018; Triant et al., 2018). Low heterozygosity in the S. ricini strain used for this project might be the key to the successful assembly: k-mer distribution analysis (k = 31) estimated that heterozygosity in one male individual of S. ricini was 0.0469 ± 0.0003% (Table 1, see Figure S2), considerably lower than the estimated heterozygosity (0.807%) of a male individual of Ay-7, an inbred line in Antheraea yamamai, which belongs to the same family (Figure S2; Kim et al., 2018). We postulated that the difference in the heterozygosity can be partly explained by the degree of difficulty in inbreeding: under laboratory condition, because multivoltine S. ricini can generate at least six generations per year, inbred crossing can be performed six times, whereas crossing of univoltine A. yamamai can be performed once a year. Thus, A. yamamai may have experienced considerably fewer generations in laboratory conditions and still retain higher heterozygosity, although we cannot exclude the presence of genetic loads that prevents reproduction of homozygous progeny for some loci in A. yamamai.

Table 1. Features of the Samia ricini genome
Total bases 450,479,495
No. of chromosomes 14 (male)/13 (female)
No. of scaffolds 155
Longest scaffold (bp) 33,970,159
Average scaffold length (bp) 2,906,319
Scaffold N50 (bp) 21,366,385
Heterozygosity (%) 0.0466–0.0472
Protein-coding genes 16,702
+C content (%) 34.3
Repetitive elements (% (bp)) 43.5 (196,045,652 bp)

Linkage analysis of 35 scaffolds (>1 Mb) revealed that the scaffolds are grouped into 14 linkage groups (Table 2; Figure S1b), which is consistent with a previous report (Yoshido et al., 2011) where BAC-FISH (bacterial artificial chromosomes—fluorescence in situ hybridization) was conducted and concluded that S. ricini has 13 autosomes and one Z chromosome (male: 2n = 28, female: 2n = 27). These 35 scaffolds totalled 443,618,927 bp, meaning that ~98.5% of the genome was assigned to the chromosomes (Table 2). Because the orientations of the scaffolds were not experimentally determined and sizes of gaps between scaffolds are unknown, chromosome-scale scaffolding techniques, Hi-C sequencing (Burton et al., 2013; Dudchenko et al., 2017) or optical mapping (Jiao et al., 2017; Ouzhuluobu et al., 2020) would be desirable.

Table 2. The result of linkage analysis
Chromosome Scaffold
Z 13, 30
1 2, 26
2 10, 17, 21, 24
3 8, 14, 19
4 4, 25
5 1, 23, 27
6 6, 35
7 11, 15
8 3
9 7
10 9, 18
11 22, 28, 31, 32, 33, 34
12 12, 16, 30
13 5, 29

3.2 Repetitive sequences found in the assembled genome

The repeatmasker program (Tarailo-Graovac & Chen, 2009) estimated that repeat elements occupy 43.5% (196,045,652 bp) of the assembled genome (Table 1). Except for “unclassified” repeats, long interspersed nuclear element (LINE) is the largest superfamily of repetitive sequences in S. ricini (Figure 2). Interestingly, although the total length of LINE and its proportion of all repetitive sequences in the genome were similar between S. ricini and B. mori (Figure 2), the components of LINE families were different. Table S8 shows the copy number of each LINE family in S. ricini and B. mori genomes. For example, while the CR1-Zenon family was the largest LINE family in S. ricini, the largest family in B. mori was Jockey. Given these results, although both S. ricini and B. mori have larger amounts of repetitive sequences in the genome than other lepidopteran species do (Figure 2a), the expansion of repetitive sequences seems to have occurred in parallel and independently on their own phylogenetic branches.

Details are in the caption following the image
Amount and proportion of repeat sequences in the Samia ricini genome. Amount (Mb; a) and proportion (%; b) of repetitive sequences in five lepidopteran species, including S. ricini, Bombyx mori, Danaus plexippus, Papilio xuthus and Plutella xylostella [Colour figure can be viewed at wileyonlinelibrary.com]

Another noteworthy feature was that the S. ricini genome contains considerably small amounts of short interspersed nuclear element (SINE) members (Figure 2a). While the B. mori genome showed a large proportion of SINEs (19.4% of all repetitive sequences), SINEs in the S. ricini genome occupied only 0.0588%. This finding also supported the hypothesis of parallel and independent expansion of repetitive sequences.

3.3 Gene prediction and comparative genome analysis

To maximize the utility of the S. ricini genomic resource for genetic research, we opted to perform gene prediction on a soft-masked genome in which annotated repetitive sequences are not converted to N’s. While hard-masking repetitive elements can prevent gene prediction within repetitive regions, there is an accompanying risk of missing genes or predicting truncated genes, especially in genomes which are abundant in repetitive sequences. evidencemodeler predicted 16,702 protein-coding genes in the soft-masked genome of S. ricini (Table 1), integrating the output of braker2, pasa, stringtie and exonerate (see “Materials and Methods” section). busco analysis revealed that 91.9% of BUSCOs were completely detected in the predicted genes (1,513, complete and single-copy; 10, complete and duplicated) among 1,658 tested BUSCOs (Table S9). The estimated completeness of the annotation is slightly lower than that of the genome (Table S4) and implies some limitation with gene prediction pipelines. This relationship between annotation and assembly completeness is not uncommon and in this study was also the case for B. mori (Table S4). Additional transcriptome data on different developmental stages might improve gene prediction where the deficiency arises from transcripts that are poorly represented in the original annotation data set. Isoform sequencing (Iso-seq) using a long-read sequencer will also contribute to more precise gene models (Sharon et al., 2013).

interproscan (Finn et al., 2017) analysis shows that the Reverse transcriptase domain (IPR 000477) and Integrase catalytic core domain (IPR001584) are the two most well-represented domains in S. ricini genes (Figure S3). As the gene prediction on soft-masked genome did not prohibit but just penalized the prediction within the repetitive regions, this result may reflect the large amount of retrotransposable elements (SINE, LINE, long terminal repeat (LTR) in Figure 2) in the S. ricini genome.

The circos plot which links single-copy orthologues among B. mori and S. ricini shows large-scale rearrangement of chromosomes, such as translocation and chromosome fusion, occurred in the ancestor of S. ricini (Figure 3a; Cheong et al., 2015; Krzywinski et al., 2009). However, despite frequent chromosomal rearrangements, genomic regions with sparse or no links between the two genomes are both infrequent and small, suggesting that there are few “species-specific” regions and that most of the genomic content of the two species is reciprocally corresponding.

Details are in the caption following the image
Comparison between the Samia ricini and Bombyx mori genome. (a) Left side of the ideogram represents chromosomes of B. mori and right side represents scaffolds of S. ricini. “bm_1” to “bm_28” correspond to the chromosomes of B. mori. For S. ricini, 35 scaffolds, “Sr_HGAP_ JL_scaf_1 (sr_1)” to “Sr_HGAP_ JL_scaf_35 (sr_35)”, are shown. Outer ring (black) indicates putative chromosomes of S. ricini. The chromosome numbers of S. ricini are given according to Yoshido et al. (2011). Note that scaffold ordering within linkage groups was not experimentally determined. (b) Venn diagram of protein orthogroups in five lepidopteran species. Number in each section indicates the number of orthogroups. (c) Phylogenetic tree of S. ricini chorion proteins (SrCho), B. mori chorion proteins (BmCho), Plutella xylostella chorion proteins (PxyCho), Papilio xuthus chorion proteins (PxuCho) and Danaus plexippus chorion proteins (DpCho). Branch colours are: red, BmCho; blue, SrCho; purple, PxyCho; green, PxuCho; orange, DpCho [Colour figure can be viewed at wileyonlinelibrary.com]

The overlap of orthogroups (OGs) identified using orthofinder (Emms & Kelly, 2015) among five Lepidoptera species is shown in Figure 3b. Note that the five species were annotated independently with different methods and some genes may have missed in a particular genome. Nonetheless, Figure 3b shows quite good matching and 205 S. ricini-specific OGs, including 1,586 genes, were identified (Figure 3b; Table S10). Of 1,586 S. ricini-specific genes, 873 were not given any GO term annotation (Table S10). Of 205 S. ricini-specific OGs, 46 are related to retrotransposable elements (Figure 2; Table S10). Thus, S. ricini-specific nonretrotransposon-related OGs numbered 159. Of these OGs, two (OG0000113 and OG0000131) consist of 33 and 30 chorion protein genes, respectively. These S. ricini-specific chorion genes are located in close proximity on Chr. 1 as a gene cluster, which may explain the high apparent duplication rate through tandem duplication or gene conversion. In addition to the above-mentioned 63 S. ricini-specific chorion genes, 17 chorion genes were found in this cluster. Table S6 summarizes all 80 chorion genes present in the S. ricini genome. Phylogenetic analysis of these genes along with chorion genes of B. mori, Plutella xylostella, Papilio xuthus and Danaus plexippus suggests that gene duplication could have resulted in diversification of chorion proteins because chorion genes from OG0000113 and OG0000131 fell into distinct clades (Figure 3c).

Chorion proteins are major components of the eggshell and protect embryos from the environment, suggesting that chorion proteins are likely to evolve to reflect adaptations to the environment (Lecanidou et al., 1986; Papantonis et al., 2015; Rodakis & Kafatos, 1982). Based on sequence homology, chorion proteins can be categorized into two groups (α and β), which include three subfamilies, respectively (Lecanidou, et al., 1986; Papantonis et al., 2015). Among the three subfamilies, the high-cysteine (Hc) chorion is considered to play an important role in embryonic diapause, because Hc chorion proteins increase the hardness of eggshells for embryos to survive diapause in the winter (Rodakis & Kafatos, 1982). Interestingly, according to the blast search and phylogenetic analysis, Hc chorion protein genes seemed to be absent in the S. ricini genome (Figure 3c; Tables S6 and S11). Moreover, average cysteine contents of 80 chorions of S. ricini were ~6.00%, whereas those of Hc class chorions of B. mori were ~27.5% (Table S12). Given that S. ricini is a nondiapause species, it is plausible that S. ricini lacks Hc chorion protein genes and that this is a functionally relevant difference between diapause and nondiapause species.

3.4 Fibroin and sericin

Fibroin is the major component of silk protein. Although fibroin of B. mori consists of three polypeptides, namely heavy-chain (Fib-H), light-chain (Fib-L) and fibrohexamerin (p25) (Inoue et al., 2000), it was biochemically confirmed that fibroin of S. ricini lacks Fib-L and p25 and it consists of Fib-H/Fib-H homodimer (Tamura & Kubota, 1988).The complete amino acid sequence of Fib-H (SrFib-H) was determined by Sezutsu and Yukuhiro (2014), but our gene prediction was unable to properly construct the gene model for SrFib-H, mainly because of its repetitive sequences. However, a tblastn search using SrFib-H as query detected the near-complete coding sequence of SrFib-H (Figure S4a), supporting the accuracy of the assembly. The genome information also revealed that the S. ricini genome has three copies of p25, in addition to Fib-H, but lacks Fib-L (Table 3). In addition, we confirmed that Fib-L is absent in the genome of A. yamamai (Kim et al., 2018), another saturniid moth, through tblastn searches (Figure S4b). Because other lepidopteran species, including B. mori, Plutella xylostella, Papilio xuthus and Corcyra cephalonica (Chaitanya & Dutta-Gupta, 2010), possess the Fib-L gene, the absence of Fib-L in saturniid moths can be ascribed to the loss of Fib-L in the common ancestor of Saturniinae.

Table 3. The presence of Fib-H, Fib-L, fibrohexamerin, p25 and sericin genes in five lepidopteran genomes
Samia ricini Bombyx mori Plutella xylostella Papilio xuthus Danaus plexippus
Fib-H 1 1 NP_001299362.1
Fib-L 0 1 NP_001299492.1,BAB39503.1 ?
fibrohexamerin (p25) 3 8 NP_001299201.1,BAB39504.1
Ser1/Ser3 6 2 ? ?
Ser2 1 1 ? ?

Note

  • Numbers for S. ricini and B. mori are copy number of each genes. The accession numbers for Papilio xuthus were derived from the transcripts of the corresponding genes, registered at GenBank. Circles for Plutella xylostella and D. plexippus indicate that the genome assembly of each species has at least one genomic region showing high similarity to the B. mori silk proteins with an e-value less than 1e-5. Question marks indicate that the blast search failed to identify any region with high similarity.

As described above, silk fibroin of B. mori consists of H-chain, L-chain and P25. The three fibroin polypeptides assemble with a 6:6:1 molecular ratio, which is considered to be indispensable for proper secretion of fibroin: mutations in Fib-H or Fib-L cause fibroin secretion deficiency (Inoue et al., 2000; Ma et al., 2014). Bombyx mori strains with deletions in Fib-H or Fib-L cannot properly secrete fibroin protein to the lumen in silk glands, and their cocoons are mainly composed of sericin. Therefore, it has been hypothesized that B. mori has a mechanism which recognizes the three-dimensional structure of fibroin assembled by the three polypeptides with 6:6:1 molecular ratio and selectively transports the fibroin polypeptide complex to the lumen in silk glands (Inoue et al., 2000). As saturniid species lack the Fib-L gene, the fibroin transportation and secretion system in saturniid species must be different from that in B. mori.

Thus far, the biological function of p25 remains unclear. Whether knockout of p25 affects the secretion of fibroin or not remains to be answered. Because p25 protein is undetectable in S. ricini silk, p25 could take on a different function other than being the part of complex structure of fibroin. The presence of multicopies of p25 in the S. ricini genome raises the possibility of functional differentiation among paralogous p25s (Table 3).

Sericin occupies the second largest proportion of silk protein, following fibroin. Unlike fibroin, sericin is water-soluble and makes up the outermost layer of silk. Bombyx mori has three sericin genes, Ser1, Ser2 and Ser3 (Tsubota et al., 2015). While Ser1 and Ser3 are components of cocoon protein, Ser2 is not present in cocoon (Takasu et al., 2010). Two proteins derived from alternative splicing of Ser2 can be found in larval silk produced during the growth stages (Takasu et al., 2010). To date, nine transcripts are registered on NCBI GenBank as Sericin-encoding genes or Sericin-like genes in S. ricini (Table S7; Dong et al., 2015; Tsubota et al., 2015). blast analysis successfully confirmed that all of them are present in the S. ricini genome and transcribed from seven loci, meaning that S. ricini has seven putative Sericin genes. Phylogenetic analysis showed that four of the seven genes are categorized within the Ser1/3 class and the other three genes are included in Ser2 (Figure S5). Despite belonging to the same family (Saturniidae), sericin gene repertoires of A. yamamai and S. ricini were quite different: Ser1/3 class genes seemed to have expanded in A. yamamai. Phylogenetic analysis revealed that all sericin genes in A. yamamai belong to the Ser1/3 class and Ser2 class genes were not identified, while S. ricini possess three Ser2 class genes (Figure S5). The diversity of sericin genes among these saturniids may reflect differences in their indigenous environments. However, whether proteins encoded by seven putative Ser genes in S. ricini are present in cocoons remains to be elucidated. Proteomic analysis on S. ricini cocoons should be carried out to reveal the protein composition.

3.5 Identification of the chromosomes responsible for larval and cocoon phenotypes in Samia ricini

To examine the feasibility of identifying trait-related genes in S. ricini, we initiated forward genetic analysis using S. ricini and S. c. pryeri. Some morphological traits are different between S. ricini and S. c. pryeri (Figures 1 and 4a), and thus such traits may be good targets for forward genetic analysis.

Details are in the caption following the image
Graphical view of Samia c. pryeri and hybrid progenies. (a) Fifth-instar larva of S. c. pryeri. (b) Fifth-instar larvae of Samia ricini and three forms of BC1 obtained by the crossing (S. ricini × S. c. pryeri) × S. ricini. (c) Cocoon of S. ricini, S. c. pryeri, F1 individuals and “Red cocoon” individuals in BC1. Scale bars: 10 mm [Colour figure can be viewed at wileyonlinelibrary.com]

As shown in Figure 4b,c, the phenotypes originally derived from S. c. pryeri were isolated in backcross generation 1 (BC1) individuals which were obtained by crossing (S. ricini × S. c. pryeri) × S. ricini. Here, we tried to identify the responsible chromosomes for four phenotypes, namely “Blue,” “Yellow,” “Spot” and “Red cocoon.” “Blue” and “Yellow” refer to a blue and yellow larval integument, respectively. “Spot” refers to black spots on the larval integument. The “Red cocoon” phenotype literally illustrates the colour of cocoons which some BC1 individuals produce.

Because meiotic recombination does not occur in lepidopteran females, all chromosomes of the BC1 individuals should be S. riciniS. c. pryeri heterozygotes or S. riciniS. ricini homozygotes, and not chimeric. Given that the above-mentioned four phenotypes derived from S. c. pryeri are dominant, the responsible chromosomes should be heterozygous in all BC1 individuals.

Genomic PCR with chromosome-specific markers, which can molecularly distinguish S. ricini and S. c. pryeri, revealed that Chr. 8, 13, 3 and 12 were uniformly heterozygotic in all examined “Blue,” “Yellow,” “Spot” and “Red cocoon” individuals, respectively, linking the causal loci for these traits to those chromosomes (Figure S6). This is the first report to demonstrate that forward genetic analysis is achievable in S. ricini (and S. c. pryeri). Although the genes responsible for the four phenotypes have not yet been identified, this is the first step toward that goal.

4 CONCLUSION

Here we have reported a high-quality genome sequence of Samia ricini, which shows reciprocal correspondence at the chromosome scale to the B. mori genome, and forward genetic analyses of specific traits in S. ricini were shown to be feasible. We successfully identified the chromosomes responsible for certain traits. We are anticipating that this report will pave the way for “forward genetics of wild silkmoth.”

ACKNOWLEDGEMENTS

We thank Dr Masahiro Kasahara for valuable advice on the assembly strategy, and Dr Ken Sahara for helpful discussions. This study was supported by MEXT KAKENHI grants 22128001, 22128004, 22128008 and 17H06431; JSPS KAKENHI grants 15H02482, 17H05047 and 18H03949; and Science and Technology Research Promotion Program for Agriculture, Forestry, Fisheries and Food Industry grant no. 26034A.

    AUTHOR CONTRIBUTIONS

    J.L. designed the experiments, analysed the data and wrote the manuscript. T.N. prepared Illumina libraries for genome sequencing. K.Y. and S.S. performed the sequence runs. Y.S. prepared the library for RNA-seq and performed the sequence runs. T.S., T.K., S.K. and J.L. discussed the results. S.K., T.K., T.S. and especially T.N. commented on and revised the manuscript.

    DATA AVAILABILITY STATEMENT

    The primary genome sequence data sets obtained in this study are available under accession nos. DRR213145–DRR213155 (Illumina short-read), and DRR213156–DRR213159 (PacBio long-read). RNA-seq data are available under accession nos. DRR213133–DRR213143. Genome assembly data have been deposited in DDBJ under accession nos. BLXV01000001–BLXV01000155.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.