Volume 48, Issue 1 pp. 60-72
Research Paper
Full Access

RNA Sequencing, De novo assembly, functional annotation and SSR analysis of the endangered diving beetle Cybister chinensis (= Cybister japonicus) using the Illumina platform

Hee-Ju Hwang

Hee-Ju Hwang

Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, Asan, Chungcheongnam-do, South Korea

These authors contributed equally to this work.Search for more papers by this author
Bharat Bhusan Patnaik

Bharat Bhusan Patnaik

Trident School of Biotech Sciences, Trident Academy of Creative Technology (TACT), Chandaka Industrial Estate, Bhubaneswar, Odisha, India

These authors contributed equally to this work.Search for more papers by this author
Se Won Kang

Se Won Kang

Biological Resource Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Jeongeup-Si, Jeollabuk-do, South Korea

Search for more papers by this author
So Young Park

So Young Park

Nakdonggang National Institute of Biological Resources, Biodiversity Conservation and Change Research Division, Sangju-si, Gyeongsangbuk-do, South Korea

Search for more papers by this author
Jong Min Chung

Jong Min Chung

Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, Asan, Chungcheongnam-do, South Korea

Search for more papers by this author
Min Kyu Sang

Min Kyu Sang

Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, Asan, Chungcheongnam-do, South Korea

Search for more papers by this author
Jie Eun Park

Jie Eun Park

Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, Asan, Chungcheongnam-do, South Korea

Search for more papers by this author
Hye Rin Min

Hye Rin Min

Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, Asan, Chungcheongnam-do, South Korea

Search for more papers by this author
Jiyeon Seong

Jiyeon Seong

Genomic Informatics center, Hankyong National University, Anseong-si, Kyonggi-do, South Korea

Search for more papers by this author
Yong Hun Jo

Yong Hun Jo

Division of Plant Biotechnology, Institute of Environmentally-Friendly Agriculture (IEFA), College of Agriculture and Life Sciences, Chonnam National University, Gwangju, Republic of Korea

Search for more papers by this author
Mi Young Noh

Mi Young Noh

Division of Plant Biotechnology, Institute of Environmentally-Friendly Agriculture (IEFA), College of Agriculture and Life Sciences, Chonnam National University, Gwangju, Republic of Korea

Search for more papers by this author
Jong Dae Lee

Jong Dae Lee

Department of Environmental Health Science, College of Natural Sciences, Soonchunhyang University, Asan, Chungcheongnam-do, South Korea

Search for more papers by this author
Ki Yoon Jung

Ki Yoon Jung

Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, Asan, Chungcheongnam-do, South Korea

Search for more papers by this author
Hong Seog Park

Hong Seog Park

Research Institute, GnC BIO Co., LTD, Daejeon, South Korea

Search for more papers by this author
Heon Cheon Jeong

Heon Cheon Jeong

Hampyeong county Insect Institute, Hampyeong County Agricultural Technology Centerm 90, Jeollanam-do, South Korea

Search for more papers by this author
Yong Seok Lee

Corresponding Author

Yong Seok Lee

Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, Asan, Chungcheongnam-do, South Korea

Correspondence

Yong Seok Lee, Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungcheongnam-do 31538, Korea.

Email: [email protected]

Search for more papers by this author
First published: 27 January 2018
Citations: 5

Abstract

Cybister chinensis Motschulsky, 1854 (synonym Cybister japonicus Sharp, 1873) is a beetle found in ponds and irrigation canals near rice fields regulating the aquatic faunal community through predation. However, due to loss of natural habitats, use of pesticides, and invasion of alien species the beetle is threatened. With lack of understanding at the trophic ecology and genomics level, the conservation study is hindered to a larger extent. In the present study, Illumina HiSeq 4000 platform has been used to unravel the whole-larval transcriptome of the beetle. A total of 20,129 non-redundant unigenes were assembled from 67,260,666 clean read sequences. About 18,743 unigenes found a homologous match in any one of the databases like PANM, UniGene, Swiss-Prot, Clusters of Orthologous Groups (COG), Gene Ontology (GO), KEGG, and InterProScan. While the zinc finger domains topped the unigene hits, about 660 enzymes (2695 sequences) participating in metabolism, environmental information processing, genetic information processing and organismal system pathways were recorded. Furthermore, the HSP70 class, Toll-like receptors 4, insulin-receptor substrate, and AMP activated protein kinase showed conspicuous presence in the larval transcriptome. Out of a total of 12,491 unigene sequences examined, 1968 SSRs were detected. Majority of them were dinucleotide repeats with six iterations followed by trinucleotide and tetranucleotide repeats with five and four iterations, respectively. This is the first report of cDNA resources from C. japonicus till date. The data would be crucial for the assessment of the beetle in the wild and making an inventory for utilisation in future genomics and ecological studies.

Introduction

Cybister japonicus Sharp (Coleoptera: Dytiscidae), commonly known as the diving beetle, is native to Southern Asia. The beetle species is located in paddy fields, spending their larval phase in water and migrating to other habitats when adult. The species is predatory in habit feeding on aquatic insects, excepting for the 3rd instar larvae that feed on vertebrate animals such as tadpoles (Ohba and Inatani 2012). This is important for understanding trophic ecology of the species under the insect conservation program (Ohba 2009a). The predatory ability of the beetle against Japanese Encephalitis Vector, Culex tritaeniorhynchus and the larval populations of Anopheles sinensis has been reported (Ree 2005; Ohba and Takagi 2010). Further, morphological and ultrastructural insights to the antennae and labial palp of the beetle has generated sufficient information on the chemoreceptors, so critical for detecting food (Song et al. 2017) and chemical communication between male and female sexes (Song et al. 2016).

The population of C. japonicus that was once seen aplenty in the rice fields and ponds, has seen a dramatic disappearance in the mainlands. The 2010 Tokyo Red List released by the Tokyo metropolitan government categorized C. japonicus as an extinct species. The species of the beetle has been threatened due to loss of natural habitats, invasion of alien species, use of pesticides in the paddy fields, and limitations in their food resources (Ohba and Inatani 2012). Considering the economic value of the species as biological control agents and the imposed endangered status, the species has been categorized under the protected list. Some recent efforts have provided insights to the feeding preferences of the species as one of the strategies towards the insect conservation plans (Ohba 2009a, 2009b). Having said that, the genetic background of the species is still unexplored making it difficult to extract phenotypic cues and provide aggressive survival strategies of the insect in the natural habitat. The National Centre for Biotechnology Information (NCBI) Taxonomy browser for C. japonicus contain the details of odorant-binding protein 1 & 2 (Song et al. 2016), cytochrome-oxidase subunit-I & II, histone III, and wingless proteins (https://www.ncbi.nlm.nih.gov/protein/?term = txid398594[Organism:noexp]) (Miller et al. 2007).

Next-generation sequencing (NGS) platforms have been increasingly used to map the genetic regulatory circuits and provide insights to survival and adaptation strategies of commercial and threatened species of insects (Morozova and Marra 2008; Wheat 2010; Patnaik et al. 2016). Moreover, the NGS platforms have been useful to generate transcriptome data and analyse the complexity of the transcriptome in non-model species of insects (Oppenheim et al. 2015; Patnaik et al. 2015; Patnaik et al. 2016) revealing the candidate genes involved in stress responses, chemosensory processes, metabolism, and immune processes of insects including the beetles (Vongsangnak et al. 2016; Duan et al. 2017; Wei et al. 2017). The Roche 454 FLX Titanium platform-based pyrosequencing technology provided large-scale gene discovery in the coleopteran pest, Eucryptorrhynchus chinensis (Liu and Wen 2016), ground beetles, Carabus iwawakianus and Carabus uenoi (Fujimaki et al. 2014), and the banana weevil, Cosmopolites sordidus (Valencia et al. 2016). The Illumina second-generation sequencing technology has taken the lead in the de novo transcriptome analysis of the coleopteran insect, Dastarcus helophoroides (Zhang et al. 2014), four species of luminescent beetles (Wang et al. 2017) and many other species utilized in the integrated pest management schemes. In this study, we used Illumina HiSeq 4000 and de novo assembly to analyse the transcriptome of the endangered diving beetle, C. japonicus. Further, the TransDecoder program was used to shortlist the putative transcripts with open-reading frames (ORF) and cluster the same to non-redundant unigenes. Functional annotation of the unigenes was conducted using the COG, GO, Interpro, and KEGG databases. We identified the simple sequence repeats (SSRs), that once validated could be used to understand the species variability and polymorphism in the populations. The genetic resource cataloguing of C. japonicus would assist in understanding the genera diversity and may be one of the tools required for insect conservation programs.

Materials and Methods

Sample collection, processing, and RNA extraction

C. japonicus is a protected species under the Endangered Wildlife law. Presently, it is categorized as the species of Least Concern (LC) in the Red List. Hence, no permit was required for the collection of the species. For RNA extraction, samples of C. japonicus adults were collected from Deogcheon-ri, Gujwa-eup, Jeju-si, Jeju-do, Korea. The whole-body of the adults were ground to fine powder in liquid nitrogen using mortar and pestle. The total RNAs were isolated using Trizol reagent (Invitrogen, Carlsbad, CA, USA) according to manufacturer's recommendations and stored at −80°C till further use. The total RNA was treated with RNase-free DNase I (Qiagen, Hilden, Germany) as described in the manufacturer's protocol. The integrity of the DNase-treated RNA was evaluated by using the NanoDrop 2000 spectrophotometer (NanoDrop, Wilmington, DE, USA) and gel electrophoresis. Before Illumina sequencing, the RNA samples were observed for RIN (RNA integrated number) > 7 using the Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).

cDNA library construction and Illumina Sequencing

The cDNA library construction and Illumina HiSeq 2500 sequencing was conducted at GnC Bio-Company Limited, Daejeon, Korea. Briefly, the total RNA was processed for mRNA purification using magnetic beads with oligo (dT). Using a fragmentation buffer, the mRNA was sheared into shorter fragments at 94°C for 5 min. Then, first-strand cDNA was synthesized from the mRNA fragments using N6 random primers and reverse transcriptase. The second-strand cDNA was synthesized using buffer, dNTPs, DNA polymerase I, and RNase H. The synthesized double-strand cDNA went through an end-repairing process using an End Repair (ERP) mix. Further, a single ‘A’ overhang was added to the 3′-end of the end-repaired fragments. This would prevent the fragments to ligate to one another during the adapter ligation process. Sequencing adapters were added to the ends of the cDNA and analysed by agarose gel electrophoresis. PCR amplification enriched the cDNA libraries and a paired-end transcriptome sequencing was conducted on Illumina HiSeq 4000 platform. This generated 2 × 100 base pairs (bp) read length. The raw data obtained from Illumina sequencing has been submitted to Sequence Read Archive (SRA) at National Centre for Biotechnology Information (NCBI) with accession number SRR5182815. The C. japonicus transcriptome is held under the BioProject PRJNA362249 and BioSample SAMN06236838. The datasets and the assembled contigs information are available for download from http://bioinfo.sch.ac.kr/submission/.

Assembly and annotation

After the generation of raw reads, the assembly was assessed using FASTQC (Version 0.11.3) (www.bioinformatics.babraham.ac.uk/projects/fastqc/). The command-line tool Cutadapt (http://code.google.com/p/cutadapt/) was used as default parameters (for paired-read ends: aADAPT1-Aadapt2; −o out1. fastq –p out2. fastq in1. fastq in2. fastq) for filtering adapter-only sequences (number of nucleotides of recognized adapter ≤13, and number of nucleotides excluding the adapter ≤35). Next, the low-quality reads were trimmed using the base-calling program Phred (quality scores ≥20) (Ewing & Green 1998). Finally, the possible GC-content bias was removed to get quality reads for an accurate de novo assembly. Trinity RNA-Seq assembly (release v2.0.4); http://gthub.com/trinityrnaseq/trinityrnaseq/) was used for the de novo assembly of the quality reads. For Trinity assembly, a K-mer of 25 and a minimum allowed length of 25 nucleotides was allowed. Trinity, combines clean reads to longer contigs (sequences contain overlap without gaps). The TransDecoder program (https://transdecoder.github.io/) was implemented to extract only the protein-coding genes from assembled contig sequences. The default parameters of a minimum length of 100 amino acids and a log-likelihood score of 0 were used for the identification. TIGR Gene Indices Clustering Tools (TGICL) program (Pertea et al. 2003) was used for clustering and defining unigenes (sequences without Ns and which could not be extended on either ends).

The assembled unigenes were annotated using blastx (BLAST, the basic local alignment tool) and an E-value of <1e-5 to several protein databases such as Protostome DB (PANM-DB), Swiss-Prot, COG, GO, Interpro, and KEGG (Kanehisa et al. 2012; Mitchell et al. 2015). The NCBI nucleotide database, UniGene DB was also used for the annotation (blastn; E-value of <1e-5) of the assembled unigenes. PANM-DB was also utilized for the homology mapping of the assembled unigenes with reference to E-value, identity, similarity distribution, and the hit and non-hit ratio. Further, the database was also used to decipher the top-hit species distribution for the assembled unigenes of C. japonicus. PANM-DB (http://malacol.or.kr/blast/PANM.html) is an efficient resource that have been developed to annotate the molluscan, arthropod, and nematode assembled sequences when compared with the NCBI non-redundant database.

Functional analytics of unigenes using COG, Blast2Go, KEGG, and InterPro domains

The unigenes were compared to the protein sequences available in the Cluster of Orthologous Groups (COG) library using blastx and then mapped to the COG classification (Tatusov et al. 2003). Further, the blastx results were imported to the Blast2Go pipeline (Conesa et al. 2005) for protein domain analysis, GO terms, and KEGG annotation. For COG classification, the unigenes were distributed under 25 different classes. The unigenes were also annotated to GO terms under biological process, cellular component, and molecular function categories (Ashburner et al. 2000). Blast2Go pipeline was also used to predict the conserved domains in the unigenes using the Interpro Scan function. To analyse unigene-relevant biochemical pathways, KEGG Orthology (KO) classification was used. The annotated unigenes under the KO classification were distributed to ‘Environmental Information Processing’, ‘Genetic Information Processing’, ‘Metabolism’, and ‘Organismal Systems’ categories.

Identification of SSRs

For the identification of microsatellites (especially SSRs) in the functional unigenes, MISA (MicroSAtellite identification tool) program v1.0 software (http://pgrc.ipk-gatersleben.de/misa/ accessed September, 2016) was used (Thiel et al., 2003). SSRs from 2 (dinucleotides) to 6 (hexanucleotides) were analysed with the repeat motif types. SSRs of 1 repeat (mononucleotides) was not consider due to a possibility of getting homopolymer sequences in Illumina platform.

Results and Discussion

De novo assembly of C. japonicus transcriptome

The Illumina HiSeq 4000 paired-end sequencing was performed on a cDNA library constructed from C. japonicus adults. A total of 675.45 million raw reads (67.54 GB) with a total of 33,772,889 read pairs (10,199,412,478 bp) were processed. Adapter trimming and the removal of contaminating and low-quality sequences identified 9,500,253,195 bp of filtered reads (Table S1). This accounts for 6.9% of raw reads discarded and an average length of trimmed reads of 140.6 bp. After quality control measurement, 99.58% (67.26 GB) high-quality reads were obtained with an average and N50 length of 139.4 bp and 151 bp, respectively. The high-quality reads accounted for a total of 174,853 contigs with the largest contig having a size of 26,683 bp. Almost 38.59% of contigs were ≥500 bp. The average length, N50 length, and GC% of contigs were 773.6 bp, 1337 bp, and 36.34%, respectively. Of the total contigs obtained, 82,133 sequences were predicted as protein-coding using the TransDecoder program. The mean length and N50 length improved to 1454.7 bp and 2520 bp, respectively, while the GC% was 38.35%. After the analysis, a total of 65.88% of sequences were having lengths ≥500 bp. Clustering of the putative protein-coding genes resulted in 20,129 sequences (37,631,641 bases) called unigenes having average and N50 lengths of 1869.5 bp and 2738 bp, respectively. The unigenes ranged in length from 140 bp to 26,683 bp. The exhaustive summary of the de novo assembly, TransDecoder analysis, and clustering has been given in Table 1.

Table 1. Overall statistical analysis of Cybister japonicus transcriptome obtained after Illumina sequencing, de novo analysis and TransDecoder-based redundancy reduction of unigenes
Sequencing
Raw reads
- Number of sequences 67,545,778
- Number of bases 10,199,412,478
Clean reads
- Number of sequences 67,260,666
- Number of bases 9,374,578,390
- Average length of contig (bp) 139.4
- N50 length of contig (bp) 151
- GC % of contig 41.91
High-quality reads (%) 99.58 (sequences), 91.91 (bases)
Contig information
- Total number of contig 174,853
- Number of bases 135,271,568
- Mean length of contig (bp) 773.6
- N50 length of contig (bp) 1,337
- GC % of contig 36.34
- Largest contig (bp) 26,683
- No. of large contigs (≥500 bp) 67,473
After TransDecoder analysis
- Total number of sequence 82,133
- Number of bases 119,475,586
- Mean length of sequence (bp) 1,454.7
- N50 length of sequence (bp) 2,520
- GC % of sequence 38.35
- Largest sequence (bp) 26,683
- No. of large sequences (≥500 bp) 54,112
Unigene information
- Total number of unigenes 20,129
- Number of bases 37,631,641
- Mean length of unigene (bp) 1,869.5
- N50 length of unigene (bp) 2,738
- GC % of unigene 37.98
- Length ranges (bp) 140–26,683

Further, we analysed the assembled contigs, TransDecoder sequences, and unigenes based on the distribution of sizes (Fig. 1). In case of the contig lengths, maximum sequences were ≤500 bp (107,521 sequences out of 174,853). The number of contigs gradually reduced with an increase in contig lengths till 2000 bp. Only 14,370 contigs (~8.22% sequences) were having sizes of >2001 bp (Fig. 1A). Only 34.17% of sequences ≤500 bp were found to be putatively protein-coding. A total of 19,634 sequences (23.58% of the total protein-coding sequences) were of length > 2001 bp (Fig. 1B). Further, 6544 unigenes (32.51%) out of a total of 20,129 sequences were of lengths >2001 bp. Only 15.85% of unigenes were having lengths ≤500 bp (Fig. 1C).

Details are in the caption following the image
Size distribution of transcriptome assembly sequences. (A) Contigs; (B) TransDecoder corrected sequences; (C) non-redundant unigenes. The size distribution of contigs demonstrated that the majority of contigs was <500 bp. A significant reduction in the number of sequences having size of <500 bp was noticed after TransDecoder software application. Majority of non-redundant unigenes was >2001 bp.

Sequence annotation of C. japonicus unigenes

For the annotation of unigenes, the sequences were queried against public databases (both protein and nucleotide) using blastx (BLAST, the basic local alignment tool) at an E value <1e-5. The locally curated database, PANM-DB version 2.0; October 2016 release (http://malacol.or.kr/blast/PANM.html) was preferred over the NCBInr database. Using PANM-DB the transcriptome processing across the phylum Arthropoda, Nematoda, and Mollusca could be conducted with high efficiency and greater speed and accuracy as compared with the NCBInr database (Kang et al. 2016b). PANM-DB was successful in annotating 18,656 transcripts out of a total annotation of 18,743 transcripts. This annotation efficiency was high when compared with all the other databases. Out of the sequences annotated in PANM-DB, 65.8% were having length ≥ 1000 bp. In total, 15,414 (76.58%), 15,662 (77.81%), 12,317 (61.19%), and 11,151 (55.4%), unigenes were annotated to the Swiss-Prot, COG, GO, and InterProScan databases, respectively (Table 2). A total of 1133 unigenes annotated to KEGG database suggesting the association of the sequences to functional pathways. Further, 38.66% of transcripts annotated to the UniGene database of nucleotide sequences. To summarize, annotation hits for 18,743 sequences out of a total of 20,129 C. japonicus unigenes were found. Out of 18,743 all transcript hits, 65.59%, 31.05%, and 3.36% of the transcripts were having lengths ≥1000 bp, 300–1000 bp, and ≤300 bp, respectively.

Table 2. Annotation of Cybister japonicus unigenes against public protein and nucleotide databases. The annotation has been classified based on the size distribution of unigenes
Databases All transcripts ≤300 bp 300–1000 bp ≥1000 bp
PANM-DB 18,656 621 5,760 12,275
UniGene 7,782 160 1,362 6,260
‘PPP'Swiss-Prot 15,414 371 3,950 11,093
COG 15,662 376 4,092 11,194
GO 12,317 277 2,914 9,126
KEGG 1,133 9 158 966
InterProScan 11,151 30 1,757 9,364
ALL 18,743 629 5,820 12,294

A Venn diagram for the shared and unique unigenes of C. japonicus according to PANM, Swiss-Prot, UniGene, and COG databases is shown in Figure 2. The maximum number (2394 sequences) of unique transcript annotation was observed with PANM-DB as compared to 35, 9, and 6 sequences for UniGene, Swiss-Prot, and COG databases, respectively. A total of 7454 transcripts were annotated in all the four databases. Further, the three protein databases shared 7489 unigenes. The sequence annotation result confirms the utility of PANM-DB as a potent resource for annotation hits for specific species like the arthropod, C. japonicus. The database has been efficiently utilized in the previous studies to characterize the de novo assembled unigenes of molluscs and arthropods alike (Park et al. 2016; Seong et al. 2016; Kang et al. 2016a). While all the earlier reports utilized version 1.0 of the database (Kang et al. 2015), the present study utilized the latest version released with updated sequences (Kang et al. 2016b). In fact, the latest release shows nearly two times more number of sequences (7,571,246) compared with the previous version (4,051,323 sequences).

Details are in the caption following the image
Venn diagram of shared and unique unigenes in C. japonicus. The non-redundant unigenes were taken as query sequences and annotated against public databases such as Protostome-DB (PANM-DB), Swiss-Prot, UniGene, and COG databases.

Homology matching of C. japonicus unigenes

The E-value, identity distribution, similarity distribution, and the number of unigene hits and non-hits for C. japonicus unigenes were matched with the homologous protein sequences in PANM-DB (Fig. 3). A greater proportion of the unigenes showed an E-value distribution of 1E-50 – 1E-5 (32%), followed by an E-value of 0 (29%), 1E-100 – 1E-50 (20%), and 1E-150 – 1E-100 (13%) (Fig. 3A). In the identity distribution analysis, most of the unigenes showed an identity of 40–80% with homologous sequences in PANM-DB. About 39% and 31% of unigenes showed an identity of 40–60% and 60–80%, respectively (Fig. 3B). The annotation of C. japonicus unigenes to the proteins in the database showed that about 46% unigenes are 60–80% similar (Fig. 3C). The number of unigene non-hits decreased considerably with an increase in the unigene lengths. This suggests that with larger unigene lengths there is a greater possibility of obtaining sequence conservation through domains (Fig. 3D). Further, among the PANM-DB annotated unigenes, 27% of the homologous species showed the best match (blast results) to the coleopteran beetle, Tribolium castaneum (Fig. 4). Another 14% of unigenes showed matches with the burying beetle, Nicrophorus vespilloides. Among the best matched beetle, the T. castaneum genome sequence is available with the assembly encoding 16,500 genes (Tribolium genome consortium, 2007). Further, sufficient information on the expressed sequence tags (ESTs) and cDNA transcriptomes of the beetle have been reported that has improved the genome understanding (Park et al. 2008; Morris et al. 2009; Altincicek et al. 2013). Similarly, the transcriptome resource of N. vespilloides has been sequenced using Illumina platform with identification of genes leading to antimicrobial immunity (Palmer et al. 2016).

Details are in the caption following the image
Homology matches of C. japonicus unigenes in specific defined features when annotated against the PANM database. (A) E-value Distribution; (B) Identity Distribution; (C) Similarity Distribution; (D) Number of unigene hits vs. non-hits.
Details are in the caption following the image
Pie model depicting Top-hit species distribution. The majority of unigene hits was with Tribolium castaneum sequences in the PANM database.

Functional annotations of C. japonicus unigenes

We annotated the C. japonicus unigenes based on the COG classification. COG classifies the unigenes to 25 diverse functional categories. For the C. japonicus unigenes, the top categories of classification includes general function prediction (21% of unigenes), signal transduction mechanisms (8.7%), function unknown (7.7%), post-translational modifications, biosynthesis, transport and catabolism (5.3%) (Fig. 5). About 19.9% of C. japonicus unigenes were annotated under the multi-category. The least represented COG categories were cell motility (0.2%), nuclear structure (0.3%), co-enzyme transport and metabolism (0.5%), cell wall/membrane/envelope biogenesis (0.7%), and defense mechanisms (0.7%). GO functional predictions for the 12,317 GO annotated unigenes showed a maximum of 4247 sequences classified under the ‘molecular function’ category, followed by 1025 under ‘cellular component’, and 412 under ‘biological process’ category (Fig. 6). A three-way Venn diagram also predicts 3374 unigenes shared between ‘biological processes’ and ‘molecular function’ categories. A total of 2239 unigenes shared between all the three GO functional terms. Only 531 and 489 unigenes were shared between ‘biological process’ and ‘cellular component’ and ‘cellular component’ and ‘molecular function’ category, respectively. About 31.44% of C. japonicus unigenes annotated to a single GO term closely followed by 30.65% unigenes annotated to two GO terms. Under ‘biological process’ category, the unigenes were predominantly classified to metabolic process (3545 unigenes), cellular process (3449 unigenes) and single-organism process (2099 unigenes) (Fig. 7A) while under the ‘cellular component’ category most unigenes were assigned to cell (1882 unigenes), cell part (1878 unigenes), and membrane (1607 unigenes) sub-categories (Fig. 7B). The predominant sub-categories under the ‘molecular function’ category included binding (6151 unigenes) and catalytic activity (3437 unigenes) (Fig. 7C). Classification of unigenes to GO term categories is only suggestive of predicted function and in no way an actual representation of function. The GO term annotations are associated with evidence codes (EC) and most of these (over 95%) are computationally-derived sources such as ‘inferred from electronic annotations (IEC)’, ‘inferred from sequence or structural similarity’, and inferred from reviewed computational analysis (RCA)’ (Rhee et al. 2008). The EC distribution of C. japonicus unigenes also suggest that over 98% of sequence annotations to GO terms were inferred from electronic annotations (Data not shown). EC describes the type of experimental support that links the unigenes to the GO ontologies. Hence, EC such as ‘inferred from direct assay (IDA)’ and ‘inferred from genetic interaction’ provides a more superlative evidence for the gene products annotated to the GO molecular function, cellular component, and biological process ontologies compared with IEC (Hill et al. 2008).

Details are in the caption following the image
COG function classification of the C. japonicus unigenes. The percent of unigenes are plotted against the COG functional classes. In total 15,662 unigenes were classified into 25 different functional classes.
Details are in the caption following the image
Gene Ontology (GO) distribution for the C. japonicus unigenes. The number of unigenes have been plotted against number of GO terms. The unigenes were annotated in three main categories: Biological process, Cellular component, and Molecular function.
Details are in the caption following the image
Gene Ontology classified annotation of the C. japonicus unigenes. The unigene sequences classified under (A) Biological Process; (B) Cellular Component, and (C) Molecular Function has been shown.

To identify the active biological pathways in C. japonicus, we mapped the unigene sequences to the reference canonical pathways in the KEGG database (Table S2). A total of 2695 sequences were assigned to 115 KEGG pathways belonging to the (Table S3) metabolism, organismal systems, environmental information processing, and genetic information processing categories. Almost 90% of sequences classified to metabolism pathway, most predominantly nucleotide metabolism (731 sequences), metabolism of cofactors and vitamins (437 sequences), and carbohydrate metabolism (294 sequences). Among nucleotide metabolism, purine metabolism and within metabolism of cofactors and vitamins, thiamine metabolism was the largely populated KEGG pathways. The metabolism of the terpenoids polyketides contained a lesser number of sequences under the secondary metabolites category. These pathways may be required to establish the ecological dependence, interactions, and evolutionary relationships (Pankewitz and Hilker 2008). Two sequences classified to ubiquinone and other terpenoid-quinone biosynthesis pathway consistent to the detection of methylhydroquinone, and toluquinone 2, 3-dimethylquinone in the larval transcriptome of the carabid beetle, Chlaenius cordicollis (Holliday et al. 2015). Among the organismal system category, the sequences were classified to the immune system pathways (194 sequences; 98 sequences of which classified to T-cell receptor signalling pathway). Included within the 2695 sequences annotated to KEGG Orthology category are 660 sequences with Enzyme Commission (EC) numbers out of which 143 belonged to carbohydrate metabolism. Further, Interpro domain analysis also provided a significant understanding of the putative functions of the unigenes (Table 3). Most conspicuously represented domains represented within the sequences are the zinc-finger C2H2 type, protein kinase type, and leucine-rich repeat type domains. These repeats are ubiquitous in most regulatory proteins with metabolic or immune functions. As understood, zinc-fingers are small but repeated units of protein motifs assisting in protein–protein contacts (Gamsjaeger et al. 2007). C2H2-type zinc fingers are the most common DNA-binding motifs found in eukaryotic transcription factors and has ability to bind to RNA and protein targets (Brayer and Segal 2008). Such motifs have been screened in the global transcriptome of many arthropods including pine-tip moth, Rhyacionia leptotubula (Zhu et al. 2013), cotton boll weevil, Anthomonas grandis (Firmino et al. 2013), and the Colorado potato beetle, Leptinotarsa decemlineata (Kumar et al. 2014). Protein kinase type domains are characteristic features of enzyme kinases that mediate many immune and metabolic signalling pathways in intracellular milieu. The protein kinase domain along with the zinc finger C2H2-type and RNA recognition motif domain has been noticed in the salivary gland transcriptome of potato leafhopper, Empoasca fabae (DeLay et al. 2012). The leucine-rich repeat region is prominently noticed in pattern-recognition receptors that is a molecular signature for the recognition of microbes, such as the Toll receptors (Akira et al. 2006). Overall, the C. japonicus transcripts belonged to putative proteins having interaction motifs participating in immune and metabolic signalling pathways.

Table 3. InterProScan Domain analysis for Cybister japonicus unigenes. The top-20 hit domains have been represented with their domain id and number of unigene hits
Domain Description unigenes
IPR027417 P-loop containing nucleoside triphosphate hydrolase 566
IPR007087 Zinc finger, C2H2 421
IPR015880 Zinc finger, C2H2-like 390
IPR013087 Zinc finger C2H2-type/integrase DNA-binding domain 362
IPR011009 Protein kinase-like domain 339
IPR000719 Protein kinase domain 284
IPR016024 Armadillo-type fold 269
IPR011993 PH domain-like 251
IPR013083 Zinc finger, RING/FYVE/PHD-type 251
IPR015943 WD40/YVTN repeat-like-containing domain 236
IPR012677 Nucleotide-binding alpha-beta plait domain 225
IPR032675 Leucine-rich repeat domain, L domain-like 217
IPR017986 WD40-repeat-containing domain 216
IPR017441 Protein kinase, ATP binding site 209
IPR011989 Armadillo-like helical 208
IPR000504 RNA recognition motif domain 200
IPR001680 WD40 repeat 186
IPR008271 Serine/threonine-protein kinase, active site 185
IPR016040 NAD(P)-binding domain 167
IPR001611 Leucine-rich repeat 159

Identification of SSRs

The Illumina 4000 based transcriptomics data provided an excellent resource for identification of SSR markers in the C. japonicus transcripts. SSR markers in the cDNA sequences have been used for gene polymorphism and population genetic studies. As these markers are transferable across species, and are obtained at a greater speed than conventional approaches (including the hybrid capture method, loci selection from available genetic information, and loci transferable from closely related species), these act as potent resource for molecular ecologists and conservation biologists (Karaiskou et al. 2008; Uliano-Silva et al. 2014). Out of the total of 20,129 unigene sequences for C. japonicus, 12,491 sequences were analysed for SSR identification. We screened 1968 SSRs from 1349 of these sequences which were classified from dinucleotides to hexanucleotides with 2 to 6 repeats units, respectively. A total of 343 sequences were found to have more than one SSR. As a matter of caution and to avoid mis-representation, we avoided using the single-nucleotide repeats that may have been generated due to Illumina-platform homopolymer generation. The dinucleotide repeats were the maximum, followed by tri- and tetranucleotide repeats. All the information regarding the screened SSRs from C. japonicus unigenes have been provided in Table S4. Using the BatchPrimer 3.0 (You et al. 2008), we were able to elucidate the primer pairs flanking the SSR motifs under the default parameters such as primer lengths of 18–23 nucleotides, PCR product size of 100–300 bases, Tm of 50–70°C and primer GC content of 30–70%.

Further, as shown in Figure 8A, a maximum of 419 dinucleotide repeats showed six iterations, followed by 214 and 148 repeats in seven and eight iterations, respectively. The trinucleotide repeats were found more in five iterations while the tetra-, penta-, and hexanucleotide repeats were found in four iterations. Among the repeat motif types, AT/AT types (574 repeats) were more predominant followed by AC/GT (413 repeats) among the dinucleotide repeats. Among the trinucleotide repeat motifs, AAT/ATT was the most predominant with 216 repeats (Fig. 8B).

Details are in the caption following the image
Microsatellite detection in the unigenes of C. japonicus unigenes. (A) The number of repeats is highlighted with number of iterations. Dinucleotide repeats with six iterations and trinucleotide repeats with five iterations were the predominant. (B) The repeat motif types identified in the unigenes.

Conclusions

This is the first exhaustive survey of transcriptomics resources from the threatened beetle, C. japonicus that was once used in the insect conservation plans. We utilized the Illumina 4000 sequencing platform to decipher the transcriptome reads, applied de novo assembly method and TransDecoder program to identify the putative protein-coding genes and annotated the same against public databases for the functional classification and identification of adaptation-related genes. The transcripts were accorded functional categories and an important group of transcripts were identified that are basic to adaptation phenotypes in the species. We have also screened SSR markers from the unigenes that would be potent in identification of species diversity.

Acknowledgments

This work was supported by the grant entitled “The Genetic and Genomic evaluation of Indigenous Biological Resources” funded by the National Institute of Biological Resources (NIBR201503202), “Analysis of genetic characteristics of endangered species” funded by the National Research Foundation (NRF-2017R1D1A3B06034971) and Soonchunhyang University Research Fund.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.