In-silico identification of putatively functional intergenic small open reading frames in the cucumber genome and their predicted response to biotic and abiotic stresses
Abstract
The availability of high-throughput sequencing technologies increased our understanding of different genomes. However, the genomes of all living organisms still have many unidentified coding sequences. The increased number of missing small open reading frames (sORFs) is due to the length threshold used in most gene identification tools, which is true in the genic and, more importantly and surprisingly, in the intergenic regions. Scanning the cucumber genome intergenic regions revealed 420 723 sORF. We excluded 3850 sORF with similarities to annotated cucumber proteins. To propose the functionality of the remaining 416 873 sORF, we calculated their codon adaptation index (CAI). We found 398 937 novel sORF (nsORF) with CAI ≥ 0.7 that were further used for downstream analysis. Searching against the Rfam database revealed 109 nsORFs similar to multiple RNA families. Using SignalP-5.0 and NLS, identified 11 592 signal peptides. Five predicted proteins interacting with Meloidogyne incognita and Powdery mildew proteins were selected using published transcriptome data of host-pathogen interactions. Gene ontology enrichment interpreted the function of those proteins, illustrating that nsORFs' expression could contribute to the cucumber's response to biotic and abiotic stresses. This research highlights the importance of previously overlooked nsORFs in the cucumber genome and provides novel insights into their potential functions.
1 INTRODUCTION
Before the genomics era, estimating the number of genes per genome was uncertain, depending on the complexity of the genome under study. The human genome, for example, was expected to contain over 100 000 genes (Schuler et al., 1996). Surprisingly, after genome sequencing and annotation, this figure dropped to approximately a quarter of its original estimate (Stein, 2004). Therefore, the question that arouses the most interest in higher species is how an organism with a relatively small number of genes can perform a wide range of functions. Some of the answers to this question were the epistatic interactions, alternative splicing machinery, non-protein coding genes (Taft et al., 2007), and, most importantly, the dozens of coding genes yet to be annotated (Cheng et al., 2017).
Most software and tools used for gene annotation apply a length threshold below which any coding sequences are ignored. This approach results in many false negatives of small open reading frames (sORFs) shorter than 300 nucleotides. However, several sORFs with 100:300 nucleotides and 30:100 amino acids (a.a.) are functional in many taxa, such as S. cerevisiae (Kastenmayer et al., 2006) and D. melanogaster (Magny et al., 2013). Numerous coding sORFs have already been identified in the intergenic regions of various species, including humans (Jain et al., 2023; Slavoff et al., 2013), Drosophila (Ladoukakis et al., 2011), C. elegance (Casimiro-Soriguer et al., 2020), rice (Stolc et al., 2005) and maize and Arabidopsis (Wang et al., 2020). Therefore, investigating intergenic areas will add valuable knowledge of controlling many functions.
In Arabidopsis, 7159 sORFs probable coding sequences were discovered (Hanada et al., 2007), with 2996 coding sORFs presumably expressed in at least one experimental condition. The same authors discovered 3241 coding sORFs in the Arabidopsis genome that show transcription or purifying selection indications, indicating that they likely represent new genes. The pls mutant of the POLARIS gene, which encodes a predicted polypeptide of 36 a.a, in Arabidopsis exhibited a short root phenotype and impaired leaf vascularisation (Casson et al., 2002). C-terminally encoded peptide (CEP), a 15-amino acid posttranslational peptide discovered in Arabidopsis, is a long-distance root-to-shoot signalling molecule in N-starvation conditions and is essential for lateral root development and nodulation (Aggarwal et al., 2020; Taleski et al., 2018). A transcriptome study of the rice response to iron excess or deficiency revealed three and 90 upregulated sORF in roots and 1076 and 50 upregulated sORF in shoots, respectively (Bashir et al., 2014).
sORFs can be categorised into five groups: (1) intergenic sORFs (Found in-between genes), (2) sORFs found in the upstream 5′ untranslated region (UTR) of an mRNA (uORFs), (3) sORFs found in the long noncoding RNAs (lncORFs), (4) short coding sequences (short CDS) and (5) short isoforms resulted from alternative splicing of other coding genes (Ong et al., 2022). Recently, an additional type of sORFs has been identified in the primary hairpin structure of microRNAs (pri-miRNAs) in several plant species and fewer animal species (Erokhina et al., 2023). The small peptides encoded from pri-miRNAs are called (miPEPs) (Dong et al., 2023).
Gene expression and gene prediction are significantly influenced by codon content (Powell and Moriyama, 1997; Salamov and Solovyev, 2000). Although synonymous codons encode for the same amino acid, they are used at different frequencies in most genes and organisms, a phenomenon called codon usage bias (CUB). Increased CUB in a genomic region indicates that it is being selected for characteristics that influence its expression levels, including translation, transcription, mRNA stability, and co-translational protein folding. Numerous, easy, and computationally efficient CUB indices reflect the expression pattern by analysing the sequenced genomes of different organisms. Such indices provide information on factors that cannot be measured experimentally (Bahiri-Elitzur and Tuller, 2021). Of those indices, relative synonymous codon usage (RSCU) determines whether a specific codon is used more frequently than anticipated (Khandia et al., 2019). Another index, the codon adaptation index (CAI), is based on codon frequency in a reference set of genes (Sharp and Li, 1987) and significantly correlates with protein abundance and mRNA levels across the genome (Zhou et al., 2016).
The primary determinant of a protein's function is its subcellular location. In silico localisation could be achieved by sequence homology to verified and localised proteins (Adelfio et al., 2013). Two more widely used techniques are looking for GO features related to specific localisations (Mei, 2012) or for the existence of motifs recognised by the receptors of the protein transport machinery (Lin et al., 2011). Recent in silico tools include the SignalP-5.0 tool (Almagro Armenteros et al., 2019), nuclear localisation signals (NLSs) (Nair, 2003), and MULcoDeep (Jiang et al., 2021, 2023).
Synthesised proteins are translocated to the endoplasmic reticulum and its derivatives, such as the Golgi complex, vacuoles and storage protein bodies or to the nucleus, mitochondria and plastids (Rozov and Deineko, 2022). Signal peptides (SPs) that facilitate localisation through the first route are different in structure, however, generally consisting of 20–30 aa with three domains: an N-terminal positively charged domain, 1–7 aa, a hydrophobic central domain, 7–15 aa, and a C-terminal polar domain, 3-7 aa (von Heijne, 1990). SPs responsible for protein translocation to the nucleus, mitochondria or chloroplast can be identified using SignalP-5.0. Two pathways mediate protein translocation across the plasma membrane and endoplasmic reticulum: the general secretory pathway (Sec) and the twin-arginine translocation pathway (Tat). During protein translocation, SPs are removed by signal peptidase I (SPaseI) (Voulhoux, 2001).
Proteins do not function solely in the cells; they interact with other proteins to perform all the complex biological processes inside the cell. Protein-protein interactions (PPIs) take place either inside the same cell (intra-species) or even between host and pathogens (inter-species), known as host-pathogen interactions (HPIs). Experimental proteomics would not give the utmost understanding of such interactions; terefore, multiple bioinformatics algorithms have been used to decipher such mechanisms (Kaundal et al., 2022; Loaiza and Kaundal, 2021; Loaiza et al., 2020). HPIs are drawing more attention as they provide therapeutics for multiple plant diseases (Han et al., 2023).
The current study aimed to improve the annotation of the cucumber genome by finding all the expressed sORFs in the intergenic regions and trying to interpret their functions. Out of 420 723 intergenic sORFs, we excluded 3850 with sequences similar to annotated cucumber proteins. The remaining 416 873 novel intergenic sORFs (nsORFs) were further investigated for their possible transcription and translation.
2 MATERIALS AND METHODS
2.1 In silico identification of sORFs in intergenic regions
Our analysis is mainly based on the 9930v3 reference genome. The third version of the genome assembly of the Chinese long inbred line 9930 (Li et al., 2019) was obtained from NCBI. We extracted the intergenic areas from the cucumber genome (seven nuclear, one chloroplast, and three mitochondrial chromosomes) using a customised script and Bedtool (Quinlan and Hall, 2010). Later, we used Emboss (Rice et al., 2000) to identify sORFs in the intergenic regions of the cucumber genome using 30 and 100 amino acids as minimum and maximum length thresholds, respectively (Hanada et al., 2007). For comparison, we used the Gy14v2.1 reference genome.
2.2 Identification of novel sORFs (nsORFs)
The newly discovered intergenic sORFs were filtered to identify nsORFs that could be translated into novel proteins not curated in the database, those with no sequence similarity to any recognised cucumber protein. In thefiltration step, the nucleotide sequences of the intergenic sORFs were used as the query on the cucumber protein database acquired from UniprotKB (https://www.uniprot.org/proteomes) (Coudert et al., 2022) using a command line Blastx search and an e-value threshold of 1e-20.
2.3 Potential expression of nsORFs
As this study aims to identify potentially expressed intergenic sORFs, several steps of downstream analysis were performed on the nsORFs as follows.
2.3.1 Codon usage analysis
We used a set of cucumber genes coding for ribosomal proteins (RPs) as a highly expressed gene reference set; these genes' IDs are provided in the additional file (1). To ensure the expression of nsORFs, we used nsORFs with CAI ≥ 0.7 for the downstream steps.
2.3.2 The similarity of nsORF sequences to noncoding and structural RNAs
Using the Infernal-1.1.2 programme and EBI cmscan search (Nawrocki and Eddy, 2013), the nsORFs with CAI ≥ 0.7 were queried against the Rfam open-source database (Kalvari et al., 2018), which contains information about noncoding and structural RNAs to identify any similarity between the nsORFs and the RNAs in the database.
2.3.3 Identifying signal peptides
The presence of SPs in any nsORFs is strong evidence of their expression. SignalP-5.0 (Almagro Armenteros et al., 2019) was used to predict the existence of SPs in amino acid sequences of the nsORFs (CAI ≥ 0.7) and their cleavage site. Nuclear localisation signals (NLS) and nuclear export signals (NES) were predicted using the NLSdb (https://rostlab.org/services/nlsdb/) (Nair, 2003).
2.3.4 Mining published cucumber transcriptome data
All RNAseq datasets generated from mRNAs represent all expressed fragments in the genome, whether from annotated genes or not. In a typical RNAseq analysis pipeline, the mapping step reveals that only those reads are mapped to the annotated genes, while other reads mapped to intergenic regions, for example, are neglected. As our project study aims to identify sORFs in those intergenic regions and prove their expression, we looked for sequencing reads in already published RNAseq data mapped to those nsORFs. We scanned already published RNAseq data from earlier studies to find whether any of our nsORFs with CAI ≥ 0.7 are represented in such studies. RNAseq data from four experiments studying cucumber response with biotic (powdery mildew and nematodes) and abiotic (cold and salt) stresses were downloaded from the SRA database (Table 1). Single-end raw reads from the cold stress study and paired-end raw reads from the remaining three studies were filtered to obtain clean reads for further analysis. The quality of the reads was checked using FastQC (Andrews, 2010), and then Trimmomatic (Bolger et al., 2014) was used to eliminate low-quality reads and adaptor sequences. Finally, we used Bowtie2 (Langmead et al., 2009) to align reads to nsORFs as an indexed reference genome.
Study | Cucurbit genome atlas IDs | Library Layout | Treatments | SRA IDs | Reference |
---|---|---|---|---|---|
P. mildew | PRJNA321023 | Paired-end | 0 dpi | SRR3488595 | Xu et al. (2017) |
2 dpi | SRR3488598 | ||||
Nematode | PRJNA419665 | Paired-end | 0 dpi | SRR6324165 | Wang et al. (2018) |
1 dpi | SRR6324160 | ||||
2 dpi | SRR6324158 | ||||
3 dpi | SRR6324176 | ||||
Cold stress | PRJNA438923 | Single-end | 0 h | SRR6854681 | Nanda et al. (2023) |
2 h | SRR6854687 | ||||
6 h | SRR6854693 | ||||
12 h | SRR6854699 | ||||
Salt stress | PRJNA437579 | Paired-end | 0 dpi | SRR6821841 | Huang et al. (2019) |
1 dpi | SRR6821835 |
- Note: Cucurbit genome atlas IDs for each project and SRA accession numbers for each treatment are listed along with the references of each study.
2.4 Function prediction of putative proteins
In an attempt to de novo annotate the putative proteins predicted from the translation of nsORFs represented in the RNAseq data of the biotic stresses, we analysed the sequences of those proteins following two different routes: (1) studying the HPI by searching for any PPI between predicted cucumber proteins and nematode or powdery mildew proteins, and (2) predicting protein function using gene ontology.
2.4.1 Predicting host-pathogen protein-protein interactions (PredHPI)
For HPIs, the PredHPI tool (Loaiza and Kaundal, 2021), specifically the Interolog module, was used to analyse the PPI between the putative cucumber proteins, translated from the previously identified sORFs expressed in the transcriptome data of all nematode-infected treatments and the two pathogens. Pathogens, Meloidogyne incognita and Podosphaera fusca, proteins sequences were downloaded from the UniProtKB database (https://www.uniprot.org/proteomes/) on the 5th of November 2023. We used the unreviewed (TrEMBL) UniProtKB protein hits in the UniProtKB database, which includes 43 718 proteins for M. incognita and 38 proteins for P. fusca.
To filter the 43 718 nematode proteins for further HPI analysis, we used the ApoplastP tool (Sperschneider et al., 2018) to check which nematode proteins are apoplastic. On the other hand, as the number of P. fusca proteins on the UniProtKB database is very low (38 proteins), we used all of them for the HPI analysis. ApoplastP is a machine-learning tool that predicts the localisation of pathogen effectors and plant proteins to the plant apoplast.
2.4.2 Gene ontology (GO terms): FFPred3
The feature-based function prediction for all GO domains (FFPred 3) (Cozzetto et al., 2016), a machine learning approach, was used for protein function prediction. FFPred facilitates the assignment of GO classes (cellular components) to the queried putative proteins using SVM classifiers. To predict the function of the putative proteins expressed from sORFs showing HPIs and obtained from the last step, they were submitted to FFPred3 to assign their GO terms.
2.4.3 Cellular localisation of the predicted cucumber proteins
To localise the five cucumber proteins showing PPIs with the nematodes and powdery mildew proteins, we used MULocDeep, a web service for protein localisation prediction at the subcellular level, which is based on a species-specific model with improved performance compared to other tools (Jiang et al., 2021, 2023).
3 RESULTS
3.1 In silico identification of sORFs in intergenic regions
The total size of the intergenic regions is ~116 Mbp, accounting for 51.60% of the entire chromosomes. Using Emboss, we identified 420,723 intergenic sORFs (Additional file 2) with a total length of 252 433 80 bp, representing 21.58% and 11.13% of the intergenic regions and cucumber chromosome sizes, respectively. Of the 420 723 identified sORFs, 413 537 sORFS were placed on the seven nuclear chromosomes, of which 206 729 and 206 808 were placed on the forward on the reverse strands, respectively. The base compositions were A = 32.40%, C = 16.49%, G = 16.94%, T = 34.17% and GC = 33.42%. The cucumber mitochondrial genome consists of three chromosomes on which we identified 7039 sORFs in their intergenic regions; 3508 and 3531 sORFs were on the forward and reverse strands, respectively. The intergenic sequences of the chloroplast chromosome contained 147 sORFs, of which 75 and 72 were on the forward and reversed strands, respectively (Table 2). In contrast, the Gy14v2.1 reference genome was assembled into seven nuclear chromosomes, chromosome zero and 758 scaffolds without annotated mitochondrial or chloroplast chromosomes. Therefore, we believe that the Gy14v2.1 reference genome is not well annotated and unsuitable for our study, and only 9930v3.0 was used for further analysis.
Genome | Chr. Accession numbers | Forward strand | Reverse strand | Total no. |
---|---|---|---|---|
Nuclear | Chr. 1 (NC_026655.2) | 32876 | 33290 | 66166 |
Chr. 2 (NC_026656.2) | 23791 | 23617 | 47408 | |
Chr. 3 (NC_026657.2) | 38082 | 38213 | 76295 | |
Chr. 4 (NC_026658.2) | 27764 | 27571 | 55335 | |
Chr. 5 (NC_026659.2) | 33232 | 33148 | 66380 | |
Chr. 6 (NC_026660.2) | 28999 | 28580 | 57579 | |
Chr. 7 (NC_026661.2) | 21985 | 22389 | 44374 | |
Total | 206729 | 206808 | 413537 | |
Mitochondrion | Chr. 1 (NC_016005.1) | 3222 | 3258 | 6480 |
Chr. 2 (NC_016004.1) | 186 | 177 | 363 | |
Chr. 3 (NC_016006.1) | 100 | 96 | 196 | |
Total | 3508 | 3531 | 7039 | |
Chloroplast | Pltd (NC_007144.1) | 75 | 72 | 147 |
Total | 210312 | 210411 | 420723 |
- Note: Chr. refers to chromosomes, and Pltd refers to the plastids' genome.
3.2 Identification of novel sORFs (nsORFs)
The blastx search of the identified intergenic sORFs against the UniProtKB cucumber protein database revealed 416,873 nsORFs with no sequence similarity with any annotated cucumber protein (Additional file 3). In contrast, 3850 sORFs showed sequence similarities with annotated cucumber proteins, which was not used in our analysis (Figure 1). Additional file (4) contains the fasta sequences of the 3850 sORFs, and Additional file (5) has their IDs, with the corresponding cucumber proteins having similar sequences.

3.3 Potential expression of nsORFs
3.3.1 Codon usage analysis
Data analysis by Emboss package revealed 8920 nsORFs with CAI ≥ 0.9, 239,177 nsORFs with CAI ≥ 0.8 and 398,937 nsORFs with CAI ≥ 0.7 (Additional file 6). The relative synonymous codon usage (RSCU) values for the reference genes set and the identified nsORFs with CAI ≥ 0.7 are shown in Additional file 7. We found 13 and 19 codons in the ribosomal proteins (RPs) genes and the nsORF, respectively, with RSCU values ≥1. Out of the 33 codons, nine codons are shared between the two datasets.
3.3.2 nsORFs sequence similarity with noncoding and structural RNAs
Searching for sequence similarities between the nsORFs with CAI ≥ 0.7 and the RNAs family using the Rfam database, Infernal, and EBI cmscan search revealed 109 nsORFs having sequence similarities with different RNA families (Table 3). Fifty nsORFs showed sequence similarity with the large subunit ribosomal RNA (LSU_rRNA_eukarya), while twenty-two nsORFs showed similarity with the small subunit rRNA (SSU_rRNA_eukarya). Seventeen nsORFs were similar to different microRNAs. Eight nsORFs were similar to the group I and II introns, self-splicing ribozymes. Five nsORFs showed similarities to tRNA. Three and two nsORFs were similar to LSU-rRNA-bacteria and SSU_rRNA_bacteria, respectively. One nsORF was similar to the iron stress repressed RNA (IsrR), and another one showed similarity to the plant signal recognition particle RNA (Plant_SRP) (Additional file 8).
RNA family ID | Rfam accession | No. of nsORFs |
---|---|---|
Plant_SRP | RF01855 | 1 |
IsrR | RF01419 | 1 |
tRNA | RF00005 | 5 |
MIR169_2 | RF00645 | 1 |
mir-172 | RF00452 | 3 |
mir-395 | RF00451 | 1 |
MIR164 | RF00647 | 2 |
mir-166 | RF00075 | 4 |
MIR159 | RF00638 | 1 |
MIR828 | RF01026 | 1 |
MIR171_1 | RF00643 | 1 |
mir-160 | RF00247 | 3 |
Intron_gpI | RF00028 | 1 |
Intron_gpII | RF00029 | 7 |
LSU_rRNA_eukarya | RF02543 | 50 |
LSU_rRNA_bacteria | RF02541 | 3 |
SSU_rRNA_eukarya | RF01960 | 22 |
SSU_rRNA_bacteria | RF00177 | 2 |
- Note: Plant_SRP refers to plant signal recognition particle. IsrR is iron stress-repressed RNA. tRNA and MIR refer to transfer RNA and microRNAs, respectively. Intron_gpI and Intron_gpII mean intron groups 1 and 2, respectively. LSU_rRNA and SSU_rRNA refer to large and small subunit rRNA, respectively.
3.3.3 Identifying signal peptides
SignalP-5.0 revealed 8377 nsORFs with signal peptides embedded in their sequences (Additional file 9). NLSdb revealed the existence of 3256 signals in the amino acid sequences of the identified nsORFs; of those signals, 137 are nuclear export signals (NES), while 3119 are nuclearlocalization signals (NLS) (Additional file 10). Comparing the results obtained from SignalP-5.0 and NLSdb revealed 41 overlapping sORFs with identified SP (Additional file 11).
3.3.4 Mining published cucumber transcriptome data
Exploring the RNAseq data from cucumber interactions with powdery mildew, nematodes, cold, and salt stresses proved the expression of 19 264, 10 828, 20 941 and 7732 nsORFs, respectively. Comparing the 58 765 expressed sORFs in the four studies revealed 592 and 272 nsORFs commonly expressed under biotic and abiotic stresses, respectively. In addition, 159 nsORFs were commonly expressed sORFs in all stresses, of which 71 and 88 were on the forward and reverse strands, respectively (Table 4).
Stress | Treatments | No. of expressed sORFs per treatment | No. of unique sORFs | No. of common sORFs | Common sORFs between biotic stresses | Common sORFs between stresses | |
---|---|---|---|---|---|---|---|
Biotic | Powdery mildew | 0 dpi | 9362 | 592 | 159 | ||
2 dpi | 9902 | 9902 | |||||
Nematode | 0 dpi | 1922 | |||||
1 dpi | 2798 | 6182 | 815 | ||||
2 dpi | 2876 | ||||||
3 dpi | 3232 | ||||||
Abiotic | Cold stress | 0 h | 4710 | 272 | |||
2 h | 4961 | 10650 | 1802 | ||||
6 h | 4849 | ||||||
12 h | 6421 | ||||||
Salt stress | 0 dpi | 1242 | |||||
1 dpi | 6490 | 6490 |
- Note: The number of unique nsORFs represents those expressed at least once in each treatment. In the biotic stresses, dpi refers to days postinfection.
3.4 Function prediction of putative proteins
3.4.1 Predicting host-pathogen interactions (HPIs) using PredHPI
Two sets of proteins, 43 718 and 38, were retrived from UniProt representing Meloidogyne incognita (nematode) and powdery mildew, respectively. Looking for apoplastic proteins directly interacting with cucumber revealed 6476 proteins in the nematode that PredHPI further analysed against the predicted proteins translated from the nematode-responsive sORFs. In the case of mildew, the total list of the 38 proteins were used for PPI analysis against the indicated proteins translated from the mildew-responsive sORFs.
The interlog module implemented in the PredHPI tool identified two cucumber putative proteins, translated from nsORFs, hereafter called A and B, interacting with three nematode proteins. Proteins A and B interact with one and two nematode proteins, respectively. On the other hand, we found HPIs between three cucumber putative proteins translated from nsORFs, hereafter called B, C, and D proteins, and different isoforms of powdery mildew cytochrome b protein. Additional file 12 includes pathogenic protein IDs, names, and enriched GO terms from UniProtKB.
3.4.2 Gene ontology (GO terms) using FFPred3
The four identified cucumber proteins were further analysed to predict their molecular function (MF), biological process (BP) and cellular components (CC). The FFPred analysis yielded 78 and 92 enriched GO terms for the two nematode-interacting proteins A and B (Additional files 13 & 14) and 61, 62 and 92 enriched GO terms for the three mildew-interacting proteins B, C and D (Additional files 15 & 16). The GO terms with the highest scores for the four proteins in each of the three GO terms classes are listed in Table 5.
sORF-P ID | GO terms | GO ID | Score | |
---|---|---|---|---|
M. incognita responsive proteins | ||||
NC_026656.2_38_12847896: 12847750(−) (A) |
MF | Structural constituent of ribosome | GO:0003735 | 0.999 |
BP | Cellular macromolecule biosynthetic process | GO:0034645 | 0.973 | |
CC | Membrane | GO:0016020 | 0.937 | |
NC_026659.2_39_13483406: 13483188(−) (B) |
MF | Ribonucleoside binding | GO:0032549 | 0.969 |
BP | Protein activation cascade | GO:0072376 | 0.991 | |
CC | Intermediate filament | GO:0005882 | 0.980 | |
Powdery mildew-responsive proteins | ||||
NC_026659.2_50_13503053: 13502802(−) (C) |
MF | Transporter activity | GO:0005215 | 0.802 |
BP | Regulation of metabolic processes | GO:0019222 | 0.941 | |
CC | An integral component of the membrane Intrinsic component of membrane |
GO:0016021 GO:0031224 |
1.000 1.000 |
|
NC_026656.2_26_21441295: 21441188(−) (D) |
MF | cytokine activity | GO:0005125 | 0.957 |
BP | regulation of the metabolic process | GO:0019222 | 0.896 | |
CC | intrinsic component of the membrane | GO:0031224 | 0.999 | |
NC_026659.2_39_13483406: 13483188(−) (B) |
MF | Ribonucleoside binding | GO:0032549 | 0.969 |
BP | Protein activation cascade | GO:0072376 | 0.991 | |
CC | Intermediate filament | GO:0005882 | 0.980 |
- Note: The cucumber proteins are the translated amino acid sequences of the nsORFs. The IDs used for cucumber proteins are the same IDs of the corresponding nsORFs from which they are translated. The IDs represent the chromosome ID, followed by the order of the nsORF on that chromosome and the nsORF coordinate. The negative sign refers to the negative strand on which sORF is located. Proteins were called A, B, C, and D to ease recalling them.
3.4.3 Cellular localisation of the predicted cucumber proteins
Using the MULocDeep web service, the A, B, C and D proteins were predicted to be mitochondrion, cytoplasmic, endoplasmic and secreted proteins (Table 6).
sORF-P ID | Predicted subcellular localisations using MULocDeep |
---|---|
M. incognita responsive proteins | |
NC_026656.2_38_12847896: | Mitochondrion |
12847750(−) (A) | |
NC_026659.2_39_13483406: | Cytoplasm |
13483188(−) (B) | |
Powdery mildew-responsive proteins | |
NC_026659.2_50_13503053: | Endoplasmic |
13502802(−) (C) | |
NC_026656.2_26_21441295: | Secreted |
21441188(−) (D) | |
NC_026659.2_39_13483406: | Cytoplasm |
13483188(−) (B) |
4 DISCUSSION
It is commonly known that reannotating available genomes provides substantially more information about the organism under investigation. For example, the reannotation of the Arabidopsis genome resulted in identifying novel protein-coding genes, transcribed areas, short RNAs, noncoding RNAs, and transcribed intergenic regions (Cheng et al., 2017). Therefore, the current study was meant to reannotate the Cucumis sativus genome for better understanding and finding more potential genes.
The cucumber reference genome 9930v3.0 analysed here was assembled into seven scaffolds, representing the seven nuclear chromosomes, ~211 Mbp (Li et al., 2019), one chloroplast and three mitochondrial chromosomes. Our analysis revealed the total size of the intergenic regions to be approximately 116 Mbp, which were further explored for possible functional coding sORF.
The alignment step is one of the most challenging steps in any traditional RNAseq analysis pipeline, where the sequenced reads are aligned to a reference genome to help identify which reads are expressed and to which level (Deshpande et al., 2023). Various alignment approaches can be followed, such as mapping the reads against the annotated transcripts in curated databases such as RefSeq (Pruitt et al., 2006); however, this approach is challenged by incomplete transcriptome data for several organisms. To overcome this limitation, the de novo assembly of reads without mapping to a reference genome was proposed; however, this is a costly approach regarding computational resources and time (Grabherr et al., 2011). A recently proposed approach maps novel transcripts, the reads that failed to be mapped to an annotated transcript against a reference genome, however, mapped to unannotated regions such as the intergenic regions (Deshpande et al., 2023). Novel transcripts (reads) were divided into gene-associated and independent transcription units (TUs). Gene-associated TUs are continuous transcription events of already annotated genes and can be upstream of the gene, downstream of the gene or a linker of genes. Independent TUs are unrelated to annotated genes and are therefore considered novel genes producing noncoding RNAs or functional proteins (Agostini et al., 2021).
Here, we followed a systematic approach to investigate the novel reads in the intergenic regions of the cucumber genome to identify nsORFs with high expression potential. To ensure that our analysis focuses primarily on independent TUs, we used the blast search to exclude those sORFs similar to the annotated cucumber proteins, as they could be pseudogenes or conserved protein domains. In addition, finding multiple nsORFs with CAI ≥ 0.7 is strong evidence of their expression and functionality. Finally, RSCU resulted in 19 nsORFs codons being biased and more frequently used.
Under the control conditions of the four published studies we used in our analysis -powdery mildew, nematode, cold and salt stresses- 9362, 1922, 4710 and 1242 nsORFs were expressed, respectively. Additional 592 nsORFs were common under mildew and nematode, indicating their candidacy in coding for proteins responding to biotic stresses. In addition, 272 nsORFs were commonly expressed under cold and salt stresses, suggesting their candidacy in cucumber response to abiotic stresses. Another 159 nsORFs were stress-responsive genes, common between biotic and abiotic stresses, such as transcription factors. Transcription factors represent 10% of plant genes (Gonzalez, 2016), and plants can use a single transcription factor to master the expression of several proteins in response to stresses (Sukumari Nath et al., 2019). The expression of a more significant number of overlapping sORF under control conditions was expected; however, the differences in the experimental conditions can partially explain the observed discrepancies. For example, the different investigated plant materials: leaves of the PM-resistant segment substitution cucumber line SSL508-28 infected by the powdery mildew (Xu et al., 2017) and the resistant line IL10–1 root tips infected with the nematode (Wang et al., 2018). Additionally, both experiments had different temperatures, 25℃ day and 20℃ night in the powdery mildew and 28℃ in the nematode experiment.
We scrutinised each library preparation condition of the four RNAseq studies to gain more information about the nature and functionality of the enriched nsORFs. In the powdery mildew study, the authors used the Illumina TruSeq™ RNA sample preparation kit (Illumina), which is known as a non-strand-specific kit that depends on the polyA nature of the mRNAs. No similar conclusions could be drawn from the other three studies as no clear information about the library preparation steps was included.
The Rfam database revealed that around 70% of the sORFs have sequence similarity to the large and small ribosomal ribonucleic acid (rRNA) subunits, indicating their possible vital role in the cell. The rRNA itself is not translated into proteins; however, it represents 80% of the total RNA types in the cell and 60% of the ribosomes (Alberts, 2004). The remaining 30% could be noncoding functional RNAs.
Additional nsORFs sequences were similar to microRNAs. microRNAs, lncRNA, primary miRNAs and circRNAs were thought not to have any coding potential; however, several studies (Mat-Sharani and Firdaus-Raih, 2019; Matsumoto et al., 2017; Wu et al., 2022) reported their translation. For example, Pri-miR171d is a primary miRNA in grapevines with three sORFs coding for the small peptides vvi-miPEP171d1, miPEP165a and miPEP171b that together play a vital role in enhancing adventitious root formation (Chen et al., 2020).
The expression of small peptides provides signals that direct them to different cellular compartments to fulfil their functions in maturation, development or stress tolerance. To prove this hypothesis, SignalP-5.0 was used, and 8532 signal peptide sequences were detected in the nsORFs. Few identified nsORFs carry a general signal peptide, either NLS or mitochondrial signal peptide, similar to earlier studies illustrating how different microproteins could be secreted and localised without signal peptides. For example, the mitochondrial sORF encoding for a small peptide of 16 amino acids called mitochondrial open reading frame of the 12S rRNA-c (MOTS-c) regulates insulin sensitivity and metabolic rate haemostasis in humans and mice (Lee et al., 2015). In Drosophila, MOTS-c activates the polycistronic polished rice (pri) sORF microproteins, 11-32 amino acids, which play a vital role in epithelial morphogenesis (Kondo et al., 2007).
The sequence of one nsORF was similar to the iron stress-repressed RNA (IsrR), a cis-encoded antisense RNA essential in regulating the expression of the photosynthetic protein isiA (Dühring et al., 2006). Another nsORF was similar to the plant signal recognition particle RNA (Plant-SRP), which controls proteins' movements within the cell and helps bind to the transmembrane pores, allowing protein localisation (Ullu and Tschudi, 1984). Additional eight nsORFs were the same as the group I and II introns. They are large, mobile, and self-splicing ribozymes that catalyse their excision from mRNA, tRNA and rRNA (Nielsen and Johansen, 2009) and can be used in bacterial genetic manipulation. Other sORFs, like group II introns, were used in E. coli to disrupt several chromosomal genes, including lacZ, trpE, dadA, and proA (Karberg et al., 2001; Lambowitz and Zimmerly, 2011). Self-spliced group I introns are rare in plant genomes but common in many plant-pathogenic fungal genomes and can be considered an attractive binding site for RNA-related molecules to increase plant resistance against different pathogenic fungal species (Malbert et al., 2023).
Identified proteins on the Uniprot database were filtered to end up with effectors directly interacting with cucumber; therefore, we searched for apoplastic proteins as solid evidence that those proteins directly influence the plant cell during the parasitism process.
SignalP-5.0 and NLSs databases failed to identify any signal peptides in the sequences of the four sORFs predicted proteins showing PPIs with the nematodes and powdery mildew proteins. Because not all proteins have signal peptides to be localised, some are co-imported with other proteins that have particular signals via the “piggyback mechanism” (Tessier et al., 2020) or translocated without any signal peptides.
The predicted function of the putative cucumber protein A, (NC_026656.2_38_12847896:12847750(−)), localised at mitochondria, that interacted with nematode protein is a structural constituent of ribosomes. Ribosomes translate genes, a severe energy-consuming process that selectively translates stress-responsive proteins (Petibon et al., 2021). It has been experimentally proved that the expression of several ribosomal proteins is affected by different stresses, such as cold stress in Arabidopsis (Bae et al., 2003) and infection with Phytophthora capsici in tomato (Howden et al., 2017). Another ribosomal protein (RP) group belongs to moonlight proteins (MLPs). For example, the amphioxus RP (L30) was verified to adopt antimicrobial activities (Chen et al., 2021). Other RP coding genes are NbRPSaA, NbRPS5A, and NbRPS24A in Nicotiana benthamiana, which, when silenced, decreased the expression of defence genes such as those encoding for antioxidant enzymes (Fakih et al., 2023). It has been demonstrated that many microbial effectors target host mitochondria to regulate immunological responses and plant cell death; however, only a small subset of these effectors have received adequate research attention (Nandi et al., 2021).
The putative protein B (NC_026659.2_39_13483406:13483188(−)), localised at the cytoplasm, has a predicted molecular function as a ribonucleoside binding protein. Protein B is expected to interact with theree nematode proteins; (1) voltage-dependent anion-selective channel protein 1 (VDAC-1) (Uozumi et al., 2015) which was found to cause chemotaxis defects when it is knocked down in C. elegance amphid wing c, (2) the nuclear pore localisation protein NPL4 which is forming a chaperon-like complex with other proteins, which hasa crucial role in S phase progression of mitotic cells and DNA replication (Mouysset et al., 2008) and mildew cytochrome b proteins which gained great attention as when it is mutated it resulted the resistance of fungi to quinol oxydation inhibitors (QoI) fungicides (Fernández-Ortuño et al., 2008). Generally, RNA-binding proteins (RBPs) facilitate the regulation of gene expression at the posttranscriptional level. In addition, RBPs enable the mRNA maturation steps, polyadenylation, 5′ capping and splicing (Fedoroff, 2002). It has been proved that RBPs are vital in plant immunity against biotic and abiotic stresses. For example, AGO1, AGO2 and AGO7 are RBP genes that play a role in Arabidopsis response to pathogens by being a component of the RISC complex stimulating pathogen-induced gene silencing (Ellendorff et al., 2008; Zhang et al., 2011).
The putative protein D (NC_026656.2_26_21441295:21441188(−)) was predicted to have cytokine activity and interact with P. mildew cytochrome b. Cytokines are vital hormones that modulate plant growth, development and physiology, and much-growing evidence confirms their role in enhancing plant resistance against different pathogens. e.g., cytokines play vital roles against P. syringae infection in Arabidopsis (Großkinsky et al., 2016) and tobacco (Großkinsky et al., 2011), Erysiphe graminis in wheat (Babosha, 2009) and Magnaporthe oryzae in rice (Akagi et al., 2014).
The putative protein C with predicted transporter activity, is endoplasmic, can be a transporter protein in endoplasmic reticulum or mitochondria, and is expected to be expressed in response to mildew cytochrome b. Transporter proteins are enormous and diverse protein groups that are integral membrane proteins with multiple transmembrane domains and helices, functioning in nearly most biological processes inside plant cells. Sugars are one of the main groups of transporter proteins, of which sugars will eventually be transported (SWEET), and ATP-binding cassette (ABC) transporters are common. SWEET genes are negative regulators of plant disease resistance used by pathogens to extract sugars from plant cells. Various SWEET genes were upregulated in Arabidopsis, grapevine and sweet potato post pathogens infections (Breia et al., 2021). ABC transporters are vital in transporting secondary metabolites and play crucial roles in plant defence mechanisms against pathogens (Erb and Kliebenstein, 2020). More studies are needed to decipher the relationships and mechanisms of interactions between the putative cucumber proteins and the nematode and mildew proteins.
Identifying nsORFs in the cucumber genome's intergenic regions is crucial to better understanding gene regulation and protein diversity. The standard identification approaches of sORFs may lead to false negative results because of the length threshold used by most gene identification tools and software. A vast and unexplored landscape of nsORFs in the intergenic regions of the cucumber genome can be discovered using combined genomics, transcriptomics and proteomics approaches. We separated the NLS into nuclear import and export signals using our in silico analysis, which will aid future research in more accurately interpreting the functions of the anticipated proteins based on their localisations. At this point, the presence of signal peptides provides compelling proof of the discovered sORFs' capacity for expression. However, more work remains to forecast sORFs functions based on their localisations. Further confirmation by PPIs, studying gene expression and co-expression networks, and analysing null mutants are required to validate their candidacy.
ACKNOWLEDGEMENTS
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
CONFLICT OF INTEREST STATEMENT
The authors have no relevant financial or nonfinancial interests to disclose.
Open Research
DATA AVAILABILITY STATEMENT
This published article and its Supplementary Data sets include all data generated or analysed during this study. The genomes of Cucumis sativus, analysed during the current study are available in the NCBI repository under the following link: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000004075.3/. The protein databases of Cucumber, meloidogyne incognita and podosphaera fusca are available under the following links: https://www.uniprot.org/proteomes/UP000029981. https://www.uniprot.org/uniprotkb?query=meloidogyne+incognita. https://www.uniprot.org/uniprotkb?query=podosphaera+fusca%2C. RNAseq data analysed are deposited at SRA under the accessions (Nematode: PRJNA419665), (Powdery mildew: PRJNA321023) and (Cold stress: PRJNA438923).