The availability of high-throughput sequencing technologies increased our understanding of different genomes. However, the genomes of all living organisms still have many unidentified coding sequences. The increased number of missing small open reading frames (sORFs) is due to the length threshold used in most gene identification tools, which is true in the genic and, more importantly and surprisingly, in the intergenic regions. Scanning the cucumber genome intergenic regions revealed 420 723 sORF. We excluded 3850 sORF with similarities to annotated cucumber proteins. To propose the functionality of the remaining 416 873 sORF, we calculated their codon adaptation index (CAI). We found 398 937 novel sORF (nsORF) with CAI ≥ 0.7 that were further used for downstream analysis. Searching against the Rfam database revealed 109 nsORFs similar to multiple RNA families. Using SignalP-5.0 and NLS, identified 11 592 signal peptides. Five predicted proteins interacting with Meloidogyne incognita and Powdery mildew proteins were selected using published transcriptome data of host-pathogen interactions. Gene ontology enrichment interpreted the function of those proteins, illustrating that nsORFs' expression could contribute to the cucumber's response to biotic and abiotic stresses. This research highlights the importance of previously overlooked nsORFs in the cucumber genome and provides novel insights into their potential functions.

1 INTRODUCTION

Before the genomics era, estimating the number of genes per genome was uncertain, depending on the complexity of the genome under study. The human genome, for example, was expected to contain over 100 000 genes (Schuler et al., 1996). Surprisingly, after genome sequencing and annotation, this figure dropped to approximately a quarter of its original estimate (Stein, 2004). Therefore, the question that arouses the most interest in higher species is how an organism with a relatively small number of genes can perform a wide range of functions. Some of the answers to this question were the epistatic interactions, alternative splicing machinery, non-protein coding genes (Taft et al., 2007), and, most importantly, the dozens of coding genes yet to be annotated (Cheng et al., 2017).

Most software and tools used for gene annotation apply a length threshold below which any coding sequences are ignored. This approach results in many false negatives of small open reading frames (sORFs) shorter than 300 nucleotides. However, several sORFs with 100:300 nucleotides and 30:100 amino acids (a.a.) are functional in many taxa, such as S. cerevisiae (Kastenmayer et al., 2006) and D. melanogaster (Magny et al., 2013). Numerous coding sORFs have already been identified in the intergenic regions of various species, including humans (Jain et al., 2023; Slavoff et al., 2013), Drosophila (Ladoukakis et al., 2011), C. elegance (Casimiro-Soriguer et al., 2020), rice (Stolc et al., 2005) and maize and Arabidopsis (Wang et al., 2020). Therefore, investigating intergenic areas will add valuable knowledge of controlling many functions.

In Arabidopsis, 7159 sORFs probable coding sequences were discovered (Hanada et al., 2007), with 2996 coding sORFs presumably expressed in at least one experimental condition. The same authors discovered 3241 coding sORFs in the Arabidopsis genome that show transcription or purifying selection indications, indicating that they likely represent new genes. The pls mutant of the POLARIS gene, which encodes a predicted polypeptide of 36 a.a, in Arabidopsis exhibited a short root phenotype and impaired leaf vascularisation (Casson et al., 2002). C-terminally encoded peptide (CEP), a 15-amino acid posttranslational peptide discovered in Arabidopsis, is a long-distance root-to-shoot signalling molecule in N-starvation conditions and is essential for lateral root development and nodulation (Aggarwal et al., 2020; Taleski et al., 2018). A transcriptome study of the rice response to iron excess or deficiency revealed three and 90 upregulated sORF in roots and 1076 and 50 upregulated sORF in shoots, respectively (Bashir et al., 2014).

sORFs can be categorised into five groups: (1) intergenic sORFs (Found in-between genes), (2) sORFs found in the upstream 5′ untranslated region (UTR) of an mRNA (uORFs), (3) sORFs found in the long noncoding RNAs (lncORFs), (4) short coding sequences (short CDS) and (5) short isoforms resulted from alternative splicing of other coding genes (Ong et al., 2022). Recently, an additional type of sORFs has been identified in the primary hairpin structure of microRNAs (pri-miRNAs) in several plant species and fewer animal species (Erokhina et al., 2023). The small peptides encoded from pri-miRNAs are called (miPEPs) (Dong et al., 2023).

Gene expression and gene prediction are significantly influenced by codon content (Powell and Moriyama, 1997; Salamov and Solovyev, 2000). Although synonymous codons encode for the same amino acid, they are used at different frequencies in most genes and organisms, a phenomenon called codon usage bias (CUB). Increased CUB in a genomic region indicates that it is being selected for characteristics that influence its expression levels, including translation, transcription, mRNA stability, and co-translational protein folding. Numerous, easy, and computationally efficient CUB indices reflect the expression pattern by analysing the sequenced genomes of different organisms. Such indices provide information on factors that cannot be measured experimentally (Bahiri-Elitzur and Tuller, 2021). Of those indices, relative synonymous codon usage (RSCU) determines whether a specific codon is used more frequently than anticipated (Khandia et al., 2019). Another index, the codon adaptation index (CAI), is based on codon frequency in a reference set of genes (Sharp and Li, 1987) and significantly correlates with protein abundance and mRNA levels across the genome (Zhou et al., 2016).

The primary determinant of a protein's function is its subcellular location. In silico localisation could be achieved by sequence homology to verified and localised proteins (Adelfio et al., 2013). Two more widely used techniques are looking for GO features related to specific localisations (Mei, 2012) or for the existence of motifs recognised by the receptors of the protein transport machinery (Lin et al., 2011). Recent in silico tools include the SignalP-5.0 tool (Almagro Armenteros et al., 2019), nuclear localisation signals (NLSs) (Nair, 2003), and MULcoDeep (Jiang et al., 2021, 2023).

Synthesised proteins are translocated to the endoplasmic reticulum and its derivatives, such as the Golgi complex, vacuoles and storage protein bodies or to the nucleus, mitochondria and plastids (Rozov and Deineko, 2022). Signal peptides (SPs) that facilitate localisation through the first route are different in structure, however, generally consisting of 20–30 aa with three domains: an N-terminal positively charged domain, 1–7 aa, a hydrophobic central domain, 7–15 aa, and a C-terminal polar domain, 3-7 aa (von Heijne, 1990). SPs responsible for protein translocation to the nucleus, mitochondria or chloroplast can be identified using SignalP-5.0. Two pathways mediate protein translocation across the plasma membrane and endoplasmic reticulum: the general secretory pathway (Sec) and the twin-arginine translocation pathway (Tat). During protein translocation, SPs are removed by signal peptidase I (SPaseI) (Voulhoux, 2001).

Proteins do not function solely in the cells; they interact with other proteins to perform all the complex biological processes inside the cell. Protein-protein interactions (PPIs) take place either inside the same cell (intra-species) or even between host and pathogens (inter-species), known as host-pathogen interactions (HPIs). Experimental proteomics would not give the utmost understanding of such interactions; terefore, multiple bioinformatics algorithms have been used to decipher such mechanisms (Kaundal et al., 2022; Loaiza and Kaundal, 2021; Loaiza et al., 2020). HPIs are drawing more attention as they provide therapeutics for multiple plant diseases (Han et al., 2023).

The current study aimed to improve the annotation of the cucumber genome by finding all the expressed sORFs in the intergenic regions and trying to interpret their functions. Out of 420 723 intergenic sORFs, we excluded 3850 with sequences similar to annotated cucumber proteins. The remaining 416 873 novel intergenic sORFs (nsORFs) were further investigated for their possible transcription and translation.

2 MATERIALS AND METHODS

2.1 In silico identification of sORFs in intergenic regions

Our analysis is mainly based on the 9930v3 reference genome. The third version of the genome assembly of the Chinese long inbred line 9930 (Li et al., 2019) was obtained from NCBI. We extracted the intergenic areas from the cucumber genome (seven nuclear, one chloroplast, and three mitochondrial chromosomes) using a customised script and Bedtool (Quinlan and Hall, 2010). Later, we used Emboss (Rice et al., 2000) to identify sORFs in the intergenic regions of the cucumber genome using 30 and 100 amino acids as minimum and maximum length thresholds, respectively (Hanada et al., 2007). For comparison, we used the Gy14v2.1 reference genome.

2.2 Identification of novel sORFs (nsORFs)

The newly discovered intergenic sORFs were filtered to identify nsORFs that could be translated into novel proteins not curated in the database, those with no sequence similarity to any recognised cucumber protein. In thefiltration step, the nucleotide sequences of the intergenic sORFs were used as the query on the cucumber protein database acquired from UniprotKB (https://www.uniprot.org/proteomes) (Coudert et al., 2022) using a command line Blastx search and an e-value threshold of 1e-20.

2.3 Potential expression of nsORFs

As this study aims to identify potentially expressed intergenic sORFs, several steps of downstream analysis were performed on the nsORFs as follows.

2.3.1 Codon usage analysis

The Emboss package (Rice et al., 2000) was used to calculate CAI for nsORFs using the following equation (Sharp and Li, 1987).

<math altimg="urn:x-wiley:01407791:media:pce15104:pce15104-math-0001" display="block" wiley:location="equation/pce15104-math-0001.png" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mtext>CAI</mtext><mspace width="0.25em"/><mo>\unicode{x0003D}</mo><msup><mfenced><mrow><msubsup><mi mathvariant="normal">\unicode{x003A0}</mi><mrow><mi>i</mi><mo>\unicode{x0003D}</mo><mn>1</mn></mrow><mi>L</mi></msubsup><msub><mi>w</mi><mi>i</mi></msub></mrow></mfenced><mfrac><mn>1</mn><mi>L</mi></mfrac></msup><mo>,</mo></mrow></mrow></math>

where L is the number of codons in the gene and relative adaptiveness, and w_i was calculated using the following equation

<math altimg="urn:x-wiley:01407791:media:pce15104:pce15104-math-0002" display="block" wiley:location="equation/pce15104-math-0002.png" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mi>w</mi><mi>i</mi></msub><mo>\unicode{x0003D}</mo><mfrac><msub><mi>f</mi><mi>i</mi></msub><mrow><mi>max</mi><mrow><mo>(</mo><msub><mi>f</mi><mi>j</mi></msub><mo>)</mo></mrow></mrow></mfrac><mo>,</mo></mrow></mrow></math>

where fi is the frequency of the ith codon, and max(fj) is the maximum frequency of the codon most often used for encoding amino acid in a set of highly expressed genes of the genome under study. The maximum CAI value is 1.

We used a set of cucumber genes coding for ribosomal proteins (RPs) as a highly expressed gene reference set; these genes' IDs are provided in the additional file (1). To ensure the expression of nsORFs, we used nsORFs with CAI ≥ 0.7 for the downstream steps.

RSCU values were calculated for the nsORFs and the reference genes set using the R package coRdon (Elek et al., 2019). RSCU is calculated using the following equation, dividing the observed frequency of a specific codon by the expected frequency of that codon, with the assumption that the different synonymous codons of an amino acid are used equally (Sharp and Li, 1987). The codon is anticipated to be used less frequently than the average codon usage when the value of RSCU is less than one. A codon is deemed moderately frequent or more frequently observed than the typical codon usage when RSCU equals or exceeds one, respectively (Anwar et al., 2021).

<math altimg="urn:x-wiley:01407791:media:pce15104:pce15104-math-0003" display="block" wiley:location="equation/pce15104-math-0003.png" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mtext>RSCU</mtext><mi mathvariant="italic">ij</mi></msub><mo>\unicode{x0003D}</mo><mfrac><msub><mi>x</mi><mi mathvariant="italic">ij</mi></msub><mrow><mfrac><mn>1</mn><msub><mi>n</mi><mi>i</mi></msub></mfrac><msubsup><mo>\unicode{x02211}</mo><mrow><mi>j</mi><mo>\unicode{x0003D}</mo><mn>1</mn></mrow><msub><mi>n</mi><mi>i</mi></msub></msubsup><msub><mi>x</mi><mi mathvariant="italic">ij</mi></msub></mrow></mfrac><mo>,</mo></mrow></mrow></math>

where x_ij is the number of occurrences of the j^th codon for the i^th amino acid, and n_i is the number (from 1 to 6) of alternative codons for the i^th amino acid.

2.3.2 The similarity of nsORF sequences to noncoding and structural RNAs

Using the Infernal-1.1.2 programme and EBI cmscan search (Nawrocki and Eddy, 2013), the nsORFs with CAI ≥ 0.7 were queried against the Rfam open-source database (Kalvari et al., 2018), which contains information about noncoding and structural RNAs to identify any similarity between the nsORFs and the RNAs in the database.

2.3.3 Identifying signal peptides

The presence of SPs in any nsORFs is strong evidence of their expression. SignalP-5.0 (Almagro Armenteros et al., 2019) was used to predict the existence of SPs in amino acid sequences of the nsORFs (CAI ≥ 0.7) and their cleavage site. Nuclear localisation signals (NLS) and nuclear export signals (NES) were predicted using the NLSdb (https://rostlab.org/services/nlsdb/) (Nair, 2003).

2.3.4 Mining published cucumber transcriptome data

All RNAseq datasets generated from mRNAs represent all expressed fragments in the genome, whether from annotated genes or not. In a typical RNAseq analysis pipeline, the mapping step reveals that only those reads are mapped to the annotated genes, while other reads mapped to intergenic regions, for example, are neglected. As our project study aims to identify sORFs in those intergenic regions and prove their expression, we looked for sequencing reads in already published RNAseq data mapped to those nsORFs. We scanned already published RNAseq data from earlier studies to find whether any of our nsORFs with CAI ≥ 0.7 are represented in such studies. RNAseq data from four experiments studying cucumber response with biotic (powdery mildew and nematodes) and abiotic (cold and salt) stresses were downloaded from the SRA database (Table 1). Single-end raw reads from the cold stress study and paired-end raw reads from the remaining three studies were filtered to obtain clean reads for further analysis. The quality of the reads was checked using FastQC (Andrews, 2010), and then Trimmomatic (Bolger et al., 2014) was used to eliminate low-quality reads and adaptor sequences. Finally, we used Bowtie2 (Langmead et al., 2009) to align reads to nsORFs as an indexed reference genome.

Table 1. List of the published transcriptome data reanalysed in our study.

Study	Cucurbit genome atlas IDs	Library Layout	Treatments	SRA IDs	Reference
P. mildew	PRJNA321023	Paired-end	0 dpi	SRR3488595	Xu et al. (2017)
P. mildew	PRJNA321023	Paired-end	2 dpi	SRR3488598	Xu et al. (2017)
Nematode	PRJNA419665	Paired-end	0 dpi	SRR6324165	Wang et al. (2018)
			1 dpi	SRR6324160
			2 dpi	SRR6324158
			3 dpi	SRR6324176
Cold stress	PRJNA438923	Single-end	0 h	SRR6854681	Nanda et al. (2023)
			2 h	SRR6854687
			6 h	SRR6854693
			12 h	SRR6854699
Salt stress	PRJNA437579	Paired-end	0 dpi	SRR6821841	Huang et al. (2019)
Salt stress	PRJNA437579	Paired-end	1 dpi	SRR6821835	Huang et al. (2019)

Note: Cucurbit genome atlas IDs for each project and SRA accession numbers for each treatment are listed along with the references of each study.

2.4 Function prediction of putative proteins

In an attempt to de novo annotate the putative proteins predicted from the translation of nsORFs represented in the RNAseq data of the biotic stresses, we analysed the sequences of those proteins following two different routes: (1) studying the HPI by searching for any PPI between predicted cucumber proteins and nematode or powdery mildew proteins, and (2) predicting protein function using gene ontology.

2.4.1 Predicting host-pathogen protein-protein interactions (PredHPI)

For HPIs, the PredHPI tool (Loaiza and Kaundal, 2021), specifically the Interolog module, was used to analyse the PPI between the putative cucumber proteins, translated from the previously identified sORFs expressed in the transcriptome data of all nematode-infected treatments and the two pathogens. Pathogens, Meloidogyne incognita and Podosphaera fusca, proteins sequences were downloaded from the UniProtKB database (https://www.uniprot.org/proteomes/) on the 5th of November 2023. We used the unreviewed (TrEMBL) UniProtKB protein hits in the UniProtKB database, which includes 43 718 proteins for M. incognita and 38 proteins for P. fusca.

To filter the 43 718 nematode proteins for further HPI analysis, we used the ApoplastP tool (Sperschneider et al., 2018) to check which nematode proteins are apoplastic. On the other hand, as the number of P. fusca proteins on the UniProtKB database is very low (38 proteins), we used all of them for the HPI analysis. ApoplastP is a machine-learning tool that predicts the localisation of pathogen effectors and plant proteins to the plant apoplast.

2.4.2 Gene ontology (GO terms): FFPred3

The feature-based function prediction for all GO domains (FFPred 3) (Cozzetto et al., 2016), a machine learning approach, was used for protein function prediction. FFPred facilitates the assignment of GO classes (cellular components) to the queried putative proteins using SVM classifiers. To predict the function of the putative proteins expressed from sORFs showing HPIs and obtained from the last step, they were submitted to FFPred3 to assign their GO terms.

2.4.3 Cellular localisation of the predicted cucumber proteins

To localise the five cucumber proteins showing PPIs with the nematodes and powdery mildew proteins, we used MULocDeep, a web service for protein localisation prediction at the subcellular level, which is based on a species-specific model with improved performance compared to other tools (Jiang et al., 2021, 2023).

3 RESULTS

3.1 In silico identification of sORFs in intergenic regions

The total size of the intergenic regions is ~116 Mbp, accounting for 51.60% of the entire chromosomes. Using Emboss, we identified 420,723 intergenic sORFs (Additional file 2) with a total length of 252 433 80 bp, representing 21.58% and 11.13% of the intergenic regions and cucumber chromosome sizes, respectively. Of the 420 723 identified sORFs, 413 537 sORFS were placed on the seven nuclear chromosomes, of which 206 729 and 206 808 were placed on the forward on the reverse strands, respectively. The base compositions were A = 32.40%, C = 16.49%, G = 16.94%, T = 34.17% and GC = 33.42%. The cucumber mitochondrial genome consists of three chromosomes on which we identified 7039 sORFs in their intergenic regions; 3508 and 3531 sORFs were on the forward and reverse strands, respectively. The intergenic sequences of the chloroplast chromosome contained 147 sORFs, of which 75 and 72 were on the forward and reversed strands, respectively (Table 2). In contrast, the Gy14v2.1 reference genome was assembled into seven nuclear chromosomes, chromosome zero and 758 scaffolds without annotated mitochondrial or chloroplast chromosomes. Therefore, we believe that the Gy14v2.1 reference genome is not well annotated and unsuitable for our study, and only 9930v3.0 was used for further analysis.

Table 2. The distribution of intergenic sORFs on the forward and reverse strands of nuclear, mitochondrial and chloroplast chromosomes.

Genome	Chr. Accession numbers	Forward strand	Reverse strand	Total no.
Nuclear	Chr. 1 (NC_026655.2)	32876	33290	66166
	Chr. 2 (NC_026656.2)	23791	23617	47408
	Chr. 3 (NC_026657.2)	38082	38213	76295
	Chr. 4 (NC_026658.2)	27764	27571	55335
	Chr. 5 (NC_026659.2)	33232	33148	66380
	Chr. 6 (NC_026660.2)	28999	28580	57579
	Chr. 7 (NC_026661.2)	21985	22389	44374
	Total	206729	206808	413537
Mitochondrion	Chr. 1 (NC_016005.1)	3222	3258	6480
	Chr. 2 (NC_016004.1)	186	177	363
	Chr. 3 (NC_016006.1)	100	96	196
	Total	3508	3531	7039
Chloroplast	Pltd (NC_007144.1)	75	72	147
Total		210312	210411	420723

Note: Chr. refers to chromosomes, and Pltd refers to the plastids' genome.

3.2 Identification of novel sORFs (nsORFs)

The blastx search of the identified intergenic sORFs against the UniProtKB cucumber protein database revealed 416,873 nsORFs with no sequence similarity with any annotated cucumber protein (Additional file 3). In contrast, 3850 sORFs showed sequence similarities with annotated cucumber proteins, which was not used in our analysis (Figure 1). Additional file (4) contains the fasta sequences of the 3850 sORFs, and Additional file (5) has their IDs, with the corresponding cucumber proteins having similar sequences.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

The search pipeline followed to identify intergenic sORFs in the cucumber genome. Numbers indicate sORFs determined at each step.

3.3 Potential expression of nsORFs

3.3.1 Codon usage analysis

Data analysis by Emboss package revealed 8920 nsORFs with CAI ≥ 0.9, 239,177 nsORFs with CAI ≥ 0.8 and 398,937 nsORFs with CAI ≥ 0.7 (Additional file 6). The relative synonymous codon usage (RSCU) values for the reference genes set and the identified nsORFs with CAI ≥ 0.7 are shown in Additional file 7. We found 13 and 19 codons in the ribosomal proteins (RPs) genes and the nsORF, respectively, with RSCU values ≥1. Out of the 33 codons, nine codons are shared between the two datasets.

3.3.2 nsORFs sequence similarity with noncoding and structural RNAs

Searching for sequence similarities between the nsORFs with CAI ≥ 0.7 and the RNAs family using the Rfam database, Infernal, and EBI cmscan search revealed 109 nsORFs having sequence similarities with different RNA families (Table 3). Fifty nsORFs showed sequence similarity with the large subunit ribosomal RNA (LSU_rRNA_eukarya), while twenty-two nsORFs showed similarity with the small subunit rRNA (SSU_rRNA_eukarya). Seventeen nsORFs were similar to different microRNAs. Eight nsORFs were similar to the group I and II introns, self-splicing ribozymes. Five nsORFs showed similarities to tRNA. Three and two nsORFs were similar to LSU-rRNA-bacteria and SSU_rRNA_bacteria, respectively. One nsORF was similar to the iron stress repressed RNA (IsrR), and another one showed similarity to the plant signal recognition particle RNA (Plant_SRP) (Additional file 8).

Table 3. Different entries at the RNAs family database (Rfam) showing sequence similarities with intergenic novel sORFs (nsORF).

RNA family ID	Rfam accession	No. of nsORFs
Plant_SRP	RF01855	1
IsrR	RF01419	1
tRNA	RF00005	5
MIR169_2	RF00645	1
mir-172	RF00452	3
mir-395	RF00451	1
MIR164	RF00647	2
mir-166	RF00075	4
MIR159	RF00638	1
MIR828	RF01026	1
MIR171_1	RF00643	1
mir-160	RF00247	3
Intron_gpI	RF00028	1
Intron_gpII	RF00029	7
LSU_rRNA_eukarya	RF02543	50
LSU_rRNA_bacteria	RF02541	3
SSU_rRNA_eukarya	RF01960	22
SSU_rRNA_bacteria	RF00177	2

Note: Plant_SRP refers to plant signal recognition particle. IsrR is iron stress-repressed RNA. tRNA and MIR refer to transfer RNA and microRNAs, respectively. Intron_gpI and Intron_gpII mean intron groups 1 and 2, respectively. LSU_rRNA and SSU_rRNA refer to large and small subunit rRNA, respectively.

3.3.3 Identifying signal peptides

SignalP-5.0 revealed 8377 nsORFs with signal peptides embedded in their sequences (Additional file 9). NLSdb revealed the existence of 3256 signals in the amino acid sequences of the identified nsORFs; of those signals, 137 are nuclear export signals (NES), while 3119 are nuclearlocalization signals (NLS) (Additional file 10). Comparing the results obtained from SignalP-5.0 and NLSdb revealed 41 overlapping sORFs with identified SP (Additional file 11).

3.3.4 Mining published cucumber transcriptome data

Exploring the RNAseq data from cucumber interactions with powdery mildew, nematodes, cold, and salt stresses proved the expression of 19 264, 10 828, 20 941 and 7732 nsORFs, respectively. Comparing the 58 765 expressed sORFs in the four studies revealed 592 and 272 nsORFs commonly expressed under biotic and abiotic stresses, respectively. In addition, 159 nsORFs were commonly expressed sORFs in all stresses, of which 71 and 88 were on the forward and reverse strands, respectively (Table 4).

Table 4. Number of intergenic nsORFs found in earlier transcriptome studies from powdery mildew, nematode infections, and cold and salt stresses.

Stress		Treatments	No. of expressed sORFs per treatment	No. of unique sORFs	No. of common sORFs	Common sORFs between biotic stresses	Common sORFs between stresses
Biotic	Powdery mildew	0 dpi	9362			592	159
	Powdery mildew	2 dpi	9902	9902
	Nematode	0 dpi	1922
		1 dpi	2798	6182	815
		2 dpi	2876
		3 dpi	3232
Abiotic	Cold stress	0 h	4710			272
		2 h	4961	10650	1802
		6 h	4849
		12 h	6421
	Salt stress	0 dpi	1242
	Salt stress	1 dpi	6490	6490

Note: The number of unique nsORFs represents those expressed at least once in each treatment. In the biotic stresses, dpi refers to days postinfection.

3.4 Function prediction of putative proteins

3.4.1 Predicting host-pathogen interactions (HPIs) using PredHPI

Two sets of proteins, 43 718 and 38, were retrived from UniProt representing Meloidogyne incognita (nematode) and powdery mildew, respectively. Looking for apoplastic proteins directly interacting with cucumber revealed 6476 proteins in the nematode that PredHPI further analysed against the predicted proteins translated from the nematode-responsive sORFs. In the case of mildew, the total list of the 38 proteins were used for PPI analysis against the indicated proteins translated from the mildew-responsive sORFs.

The interlog module implemented in the PredHPI tool identified two cucumber putative proteins, translated from nsORFs, hereafter called A and B, interacting with three nematode proteins. Proteins A and B interact with one and two nematode proteins, respectively. On the other hand, we found HPIs between three cucumber putative proteins translated from nsORFs, hereafter called B, C, and D proteins, and different isoforms of powdery mildew cytochrome b protein. Additional file 12 includes pathogenic protein IDs, names, and enriched GO terms from UniProtKB.

3.4.2 Gene ontology (GO terms) using FFPred3

The four identified cucumber proteins were further analysed to predict their molecular function (MF), biological process (BP) and cellular components (CC). The FFPred analysis yielded 78 and 92 enriched GO terms for the two nematode-interacting proteins A and B (Additional files 13 & 14) and 61, 62 and 92 enriched GO terms for the three mildew-interacting proteins B, C and D (Additional files 15 & 16). The GO terms with the highest scores for the four proteins in each of the three GO terms classes are listed in Table 5.

Table 5. Predicted molecular function (MF), biological process (BP) and cellular components (CC) of the five cucumber proteins that showed HPI with M. incognta and powdery mildew proteins.

sORF-P ID	GO terms		GO ID	Score
M. incognita responsive proteins
NC_026656.2_38_12847896: 12847750(−) (A)	MF	Structural constituent of ribosome	GO:0003735	0.999
	BP	Cellular macromolecule biosynthetic process	GO:0034645	0.973
	CC	Membrane	GO:0016020	0.937
NC_026659.2_39_13483406: 13483188(−) (B)	MF	Ribonucleoside binding	GO:0032549	0.969
	BP	Protein activation cascade	GO:0072376	0.991
	CC	Intermediate filament	GO:0005882	0.980
Powdery mildew-responsive proteins
NC_026659.2_50_13503053: 13502802(−) (C)	MF	Transporter activity	GO:0005215	0.802
	BP	Regulation of metabolic processes	GO:0019222	0.941
	CC	An integral component of the membrane Intrinsic component of membrane	GO:0016021 GO:0031224	1.000 1.000
NC_026656.2_26_21441295: 21441188(−) (D)	MF	cytokine activity	GO:0005125	0.957
	BP	regulation of the metabolic process	GO:0019222	0.896
	CC	intrinsic component of the membrane	GO:0031224	0.999
NC_026659.2_39_13483406: 13483188(−) (B)	MF	Ribonucleoside binding	GO:0032549	0.969
	BP	Protein activation cascade	GO:0072376	0.991
	CC	Intermediate filament	GO:0005882	0.980

Note: The cucumber proteins are the translated amino acid sequences of the nsORFs. The IDs used for cucumber proteins are the same IDs of the corresponding nsORFs from which they are translated. The IDs represent the chromosome ID, followed by the order of the nsORF on that chromosome and the nsORF coordinate. The negative sign refers to the negative strand on which sORF is located. Proteins were called A, B, C, and D to ease recalling them.

3.4.3 Cellular localisation of the predicted cucumber proteins

Using the MULocDeep web service, the A, B, C and D proteins were predicted to be mitochondrion, cytoplasmic, endoplasmic and secreted proteins (Table 6).

Table 6. Predicted subcellular localisations using MULocDeep of the five cucumber proteins that showed HPI with M. incognta and powdery mildew proteins.

sORF-P ID	Predicted subcellular localisations using MULocDeep
M. incognita responsive proteins
NC_026656.2_38_12847896:	Mitochondrion
12847750(−) (A)	Mitochondrion
NC_026659.2_39_13483406:	Cytoplasm
13483188(−) (B)	Cytoplasm
Powdery mildew-responsive proteins
NC_026659.2_50_13503053:	Endoplasmic
13502802(−) (C)	Endoplasmic
NC_026656.2_26_21441295:	Secreted
21441188(−) (D)	Secreted
NC_026659.2_39_13483406:	Cytoplasm
13483188(−) (B)	Cytoplasm

4 DISCUSSION

It is commonly known that reannotating available genomes provides substantially more information about the organism under investigation. For example, the reannotation of the Arabidopsis genome resulted in identifying novel protein-coding genes, transcribed areas, short RNAs, noncoding RNAs, and transcribed intergenic regions (Cheng et al., 2017). Therefore, the current study was meant to reannotate the Cucumis sativus genome for better understanding and finding more potential genes.

The cucumber reference genome 9930v3.0 analysed here was assembled into seven scaffolds, representing the seven nuclear chromosomes, ~211 Mbp (Li et al., 2019), one chloroplast and three mitochondrial chromosomes. Our analysis revealed the total size of the intergenic regions to be approximately 116 Mbp, which were further explored for possible functional coding sORF.

The alignment step is one of the most challenging steps in any traditional RNAseq analysis pipeline, where the sequenced reads are aligned to a reference genome to help identify which reads are expressed and to which level (Deshpande et al., 2023). Various alignment approaches can be followed, such as mapping the reads against the annotated transcripts in curated databases such as RefSeq (Pruitt et al., 2006); however, this approach is challenged by incomplete transcriptome data for several organisms. To overcome this limitation, the de novo assembly of reads without mapping to a reference genome was proposed; however, this is a costly approach regarding computational resources and time (Grabherr et al., 2011). A recently proposed approach maps novel transcripts, the reads that failed to be mapped to an annotated transcript against a reference genome, however, mapped to unannotated regions such as the intergenic regions (Deshpande et al., 2023). Novel transcripts (reads) were divided into gene-associated and independent transcription units (TUs). Gene-associated TUs are continuous transcription events of already annotated genes and can be upstream of the gene, downstream of the gene or a linker of genes. Independent TUs are unrelated to annotated genes and are therefore considered novel genes producing noncoding RNAs or functional proteins (Agostini et al., 2021).

Here, we followed a systematic approach to investigate the novel reads in the intergenic regions of the cucumber genome to identify nsORFs with high expression potential. To ensure that our analysis focuses primarily on independent TUs, we used the blast search to exclude those sORFs similar to the annotated cucumber proteins, as they could be pseudogenes or conserved protein domains. In addition, finding multiple nsORFs with CAI ≥ 0.7 is strong evidence of their expression and functionality. Finally, RSCU resulted in 19 nsORFs codons being biased and more frequently used.

Under the control conditions of the four published studies we used in our analysis -powdery mildew, nematode, cold and salt stresses- 9362, 1922, 4710 and 1242 nsORFs were expressed, respectively. Additional 592 nsORFs were common under mildew and nematode, indicating their candidacy in coding for proteins responding to biotic stresses. In addition, 272 nsORFs were commonly expressed under cold and salt stresses, suggesting their candidacy in cucumber response to abiotic stresses. Another 159 nsORFs were stress-responsive genes, common between biotic and abiotic stresses, such as transcription factors. Transcription factors represent 10% of plant genes (Gonzalez, 2016), and plants can use a single transcription factor to master the expression of several proteins in response to stresses (Sukumari Nath et al., 2019). The expression of a more significant number of overlapping sORF under control conditions was expected; however, the differences in the experimental conditions can partially explain the observed discrepancies. For example, the different investigated plant materials: leaves of the PM-resistant segment substitution cucumber line SSL508-28 infected by the powdery mildew (Xu et al., 2017) and the resistant line IL10–1 root tips infected with the nematode (Wang et al., 2018). Additionally, both experiments had different temperatures, 25℃ day and 20℃ night in the powdery mildew and 28℃ in the nematode experiment.

We scrutinised each library preparation condition of the four RNAseq studies to gain more information about the nature and functionality of the enriched nsORFs. In the powdery mildew study, the authors used the Illumina TruSeq™ RNA sample preparation kit (Illumina), which is known as a non-strand-specific kit that depends on the polyA nature of the mRNAs. No similar conclusions could be drawn from the other three studies as no clear information about the library preparation steps was included.

The Rfam database revealed that around 70% of the sORFs have sequence similarity to the large and small ribosomal ribonucleic acid (rRNA) subunits, indicating their possible vital role in the cell. The rRNA itself is not translated into proteins; however, it represents 80% of the total RNA types in the cell and 60% of the ribosomes (Alberts, 2004). The remaining 30% could be noncoding functional RNAs.

Additional nsORFs sequences were similar to microRNAs. microRNAs, lncRNA, primary miRNAs and circRNAs were thought not to have any coding potential; however, several studies (Mat-Sharani and Firdaus-Raih, 2019; Matsumoto et al., 2017; Wu et al., 2022) reported their translation. For example, Pri-miR171d is a primary miRNA in grapevines with three sORFs coding for the small peptides vvi-miPEP171d1, miPEP165a and miPEP171b that together play a vital role in enhancing adventitious root formation (Chen et al., 2020).

The expression of small peptides provides signals that direct them to different cellular compartments to fulfil their functions in maturation, development or stress tolerance. To prove this hypothesis, SignalP-5.0 was used, and 8532 signal peptide sequences were detected in the nsORFs. Few identified nsORFs carry a general signal peptide, either NLS or mitochondrial signal peptide, similar to earlier studies illustrating how different microproteins could be secreted and localised without signal peptides. For example, the mitochondrial sORF encoding for a small peptide of 16 amino acids called mitochondrial open reading frame of the 12S rRNA-c (MOTS-c) regulates insulin sensitivity and metabolic rate haemostasis in humans and mice (Lee et al., 2015). In Drosophila, MOTS-c activates the polycistronic polished rice (pri) sORF microproteins, 11-32 amino acids, which play a vital role in epithelial morphogenesis (Kondo et al., 2007).

The sequence of one nsORF was similar to the iron stress-repressed RNA (IsrR), a cis-encoded antisense RNA essential in regulating the expression of the photosynthetic protein isiA (Dühring et al., 2006). Another nsORF was similar to the plant signal recognition particle RNA (Plant-SRP), which controls proteins' movements within the cell and helps bind to the transmembrane pores, allowing protein localisation (Ullu and Tschudi, 1984). Additional eight nsORFs were the same as the group I and II introns. They are large, mobile, and self-splicing ribozymes that catalyse their excision from mRNA, tRNA and rRNA (Nielsen and Johansen, 2009) and can be used in bacterial genetic manipulation. Other sORFs, like group II introns, were used in E. coli to disrupt several chromosomal genes, including lacZ, trpE, dadA, and proA (Karberg et al., 2001; Lambowitz and Zimmerly, 2011). Self-spliced group I introns are rare in plant genomes but common in many plant-pathogenic fungal genomes and can be considered an attractive binding site for RNA-related molecules to increase plant resistance against different pathogenic fungal species (Malbert et al., 2023).

Identified proteins on the Uniprot database were filtered to end up with effectors directly interacting with cucumber; therefore, we searched for apoplastic proteins as solid evidence that those proteins directly influence the plant cell during the parasitism process.

SignalP-5.0 and NLSs databases failed to identify any signal peptides in the sequences of the four sORFs predicted proteins showing PPIs with the nematodes and powdery mildew proteins. Because not all proteins have signal peptides to be localised, some are co-imported with other proteins that have particular signals via the “piggyback mechanism” (Tessier et al., 2020) or translocated without any signal peptides.

The predicted function of the putative cucumber protein A, (NC_026656.2_38_12847896:12847750(−)), localised at mitochondria, that interacted with nematode protein is a structural constituent of ribosomes. Ribosomes translate genes, a severe energy-consuming process that selectively translates stress-responsive proteins (Petibon et al., 2021). It has been experimentally proved that the expression of several ribosomal proteins is affected by different stresses, such as cold stress in Arabidopsis (Bae et al., 2003) and infection with Phytophthora capsici in tomato (Howden et al., 2017). Another ribosomal protein (RP) group belongs to moonlight proteins (MLPs). For example, the amphioxus RP (L30) was verified to adopt antimicrobial activities (Chen et al., 2021). Other RP coding genes are NbRPSaA, NbRPS5A, and NbRPS24A in Nicotiana benthamiana, which, when silenced, decreased the expression of defence genes such as those encoding for antioxidant enzymes (Fakih et al., 2023). It has been demonstrated that many microbial effectors target host mitochondria to regulate immunological responses and plant cell death; however, only a small subset of these effectors have received adequate research attention (Nandi et al., 2021).

The putative protein B (NC_026659.2_39_13483406:13483188(−)), localised at the cytoplasm, has a predicted molecular function as a ribonucleoside binding protein. Protein B is expected to interact with theree nematode proteins; (1) voltage-dependent anion-selective channel protein 1 (VDAC-1) (Uozumi et al., 2015) which was found to cause chemotaxis defects when it is knocked down in C. elegance amphid wing c, (2) the nuclear pore localisation protein NPL4 which is forming a chaperon-like complex with other proteins, which hasa crucial role in S phase progression of mitotic cells and DNA replication (Mouysset et al., 2008) and mildew cytochrome b proteins which gained great attention as when it is mutated it resulted the resistance of fungi to quinol oxydation inhibitors (QoI) fungicides (Fernández-Ortuño et al., 2008). Generally, RNA-binding proteins (RBPs) facilitate the regulation of gene expression at the posttranscriptional level. In addition, RBPs enable the mRNA maturation steps, polyadenylation, 5′ capping and splicing (Fedoroff, 2002). It has been proved that RBPs are vital in plant immunity against biotic and abiotic stresses. For example, AGO1, AGO2 and AGO7 are RBP genes that play a role in Arabidopsis response to pathogens by being a component of the RISC complex stimulating pathogen-induced gene silencing (Ellendorff et al., 2008; Zhang et al., 2011).

The putative protein D (NC_026656.2_26_21441295:21441188(−)) was predicted to have cytokine activity and interact with P. mildew cytochrome b. Cytokines are vital hormones that modulate plant growth, development and physiology, and much-growing evidence confirms their role in enhancing plant resistance against different pathogens. e.g., cytokines play vital roles against P. syringae infection in Arabidopsis (Großkinsky et al., 2016) and tobacco (Großkinsky et al., 2011), Erysiphe graminis in wheat (Babosha, 2009) and Magnaporthe oryzae in rice (Akagi et al., 2014).

The putative protein C with predicted transporter activity, is endoplasmic, can be a transporter protein in endoplasmic reticulum or mitochondria, and is expected to be expressed in response to mildew cytochrome b. Transporter proteins are enormous and diverse protein groups that are integral membrane proteins with multiple transmembrane domains and helices, functioning in nearly most biological processes inside plant cells. Sugars are one of the main groups of transporter proteins, of which sugars will eventually be transported (SWEET), and ATP-binding cassette (ABC) transporters are common. SWEET genes are negative regulators of plant disease resistance used by pathogens to extract sugars from plant cells. Various SWEET genes were upregulated in Arabidopsis, grapevine and sweet potato post pathogens infections (Breia et al., 2021). ABC transporters are vital in transporting secondary metabolites and play crucial roles in plant defence mechanisms against pathogens (Erb and Kliebenstein, 2020). More studies are needed to decipher the relationships and mechanisms of interactions between the putative cucumber proteins and the nematode and mildew proteins.

Identifying nsORFs in the cucumber genome's intergenic regions is crucial to better understanding gene regulation and protein diversity. The standard identification approaches of sORFs may lead to false negative results because of the length threshold used by most gene identification tools and software. A vast and unexplored landscape of nsORFs in the intergenic regions of the cucumber genome can be discovered using combined genomics, transcriptomics and proteomics approaches. We separated the NLS into nuclear import and export signals using our in silico analysis, which will aid future research in more accurately interpreting the functions of the anticipated proteins based on their localisations. At this point, the presence of signal peptides provides compelling proof of the discovered sORFs' capacity for expression. However, more work remains to forecast sORFs functions based on their localisations. Further confirmation by PPIs, studying gene expression and co-expression networks, and analysing null mutants are required to validate their candidacy.

ACKNOWLEDGEMENTS

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

CONFLICT OF INTEREST STATEMENT

The authors have no relevant financial or nonfinancial interests to disclose.

Open Research

DATA AVAILABILITY STATEMENT

This published article and its Supplementary Data sets include all data generated or analysed during this study. The genomes of Cucumis sativus, analysed during the current study are available in the NCBI repository under the following link: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000004075.3/. The protein databases of Cucumber, meloidogyne incognita and podosphaera fusca are available under the following links: https://www.uniprot.org/proteomes/UP000029981. https://www.uniprot.org/uniprotkb?query=meloidogyne+incognita. https://www.uniprot.org/uniprotkb?query=podosphaera+fusca%2C. RNAseq data analysed are deposited at SRA under the accessions (Nematode: PRJNA419665), (Powdery mildew: PRJNA321023) and (Cold stress: PRJNA438923).

Supporting Information

REFERENCES

Adelfio, A., Volpato, V. & Pollastri, G. (2013) SCLpredT: Ab initio and homology-based prediction of subcellular localization by N-to-1 neural networks. SpringerPlus, 2(1), 502.
10.1186/2193-1801-2-502
PubMed Google Scholar
Aggarwal, S., Kumar, A., Jain, M., Sudan, J., Singh, K., Kumari, S. et al. (2020) C-terminally encoded peptides (CEPs) are potential mediators of abiotic stress response in plants. Physiology and Molecular Biology of Plants, 26(10), 2019–2033.
10.1007/s12298-020-00881-4
CAS PubMed Google Scholar
Agostini, F., Zagalak, J., Attig, J., Ule, J. & Luscombe, N.M. (2021) Intergenic RNA mainly derives from nascent transcripts of known genes. Genome Biology, 22(1), 136.
10.1186/s13059-021-02350-x
CAS PubMed Google Scholar
Akagi, A., Fukushima, S., Okada, K., Jiang, C.-J., Yoshida, R., Nakayama, A. et al. (2014) WRKY45-dependent priming of diterpenoid phytoalexin biosynthesis in rice and the role of cytokinin in triggering the reaction. Plant Molecular Biology, 86(1), 171–183.
10.1007/s11103-014-0221-x
CAS PubMed Google Scholar
Alberts, B. (2004) Molecular biology of the cell, Cells and genomes. Garland: National Center for Biotechnology Information.
Google Scholar
Almagro Armenteros, J.J., Tsirigos, K.D., Sønderby, C.K., Petersen, T.N., Winther, O., Brunak, S. et al. (2019) SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature Biotechnology, 37(4), 420–423.
10.1038/s41587-019-0036-z
CAS PubMed Web of Science® Google Scholar
Andrews, S. (2010) FastQC: a quality control tool for high throughput sequence data, Babraham bioinformatics. Cambridge, United Kingdom: Babraham Institute.
Google Scholar
Anwar, A.M., Aljabri, M. & El-Soda, M. (2021) Patterns of genome-wide codon usage bias in tobacco, tomato and potato. Biotechnology & Biotechnological Equipment, 35(1), 657–664.
10.1080/13102818.2021.1911684
CAS Google Scholar
Babosha, A.V. (2009) Regulation of resistance and susceptibility in wheat–powdery mildew pathosystem with exogenous cytokinins. Journal of Plant Physiology, 166(17), 1892–1903.
10.1016/j.jplph.2009.05.014
CAS PubMed Google Scholar
Bae, M.S., Cho, E.J., Choi, E.-Y. & Park, O.K. (2003) Analysis of the arabidopsis nuclear proteome and its response to cold stress. The Plant Journal, 36(5), 652–663.
10.1046/j.1365-313X.2003.01907.x
CAS PubMed Web of Science® Google Scholar
Bahiri-Elitzur, S. & Tuller, T. (2021) Codon-based indices for modeling gene expression and transcript evolution. Computational and Structural Biotechnology Journal, 19, 2646–2663.
10.1016/j.csbj.2021.04.042
CAS PubMed Web of Science® Google Scholar
Bashir, K., Hanada, K., Shimizu, M., Seki, M., Nakanishi, H. & Nishizawa, N.K. (2014) Transcriptomic analysis of rice in response to iron deficiency and excess. Rice, 7(1), 18.
10.1186/s12284-014-0018-1
PubMed Google Scholar
Bolger, A.M., Lohse, M. & Usadel, B. (2014) Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics, 30(15), 2114–2120.
10.1093/bioinformatics/btu170
CAS PubMed Web of Science® Google Scholar
Breia, R., Conde, A., Badim, H., Fortes, A.M., Gerós, H. & Granell, A. (2021) Plant SWEETs: from sugar transport to plant–pathogen interaction and more unexpected physiological roles. Plant Physiology, 186(2), 836–852.
10.1093/plphys/kiab127
CAS PubMed Web of Science® Google Scholar
Casimiro-Soriguer, C.S., Rigual, M.M., Brokate-Llanos, A.M., Muñoz, M.J., Garzón, A., Pérez-Pulido, A.J. et al. (2020) Using AnABlast for intergenic sORF prediction in the Caenorhabditis elegans genome. Bioinformatics, 36(19), 4827–4832.
10.1093/bioinformatics/btaa608
CAS PubMed Google Scholar
Casson, S.A., Chilley, P.M., Topping, J.F., Evans, I.M., Souter, M.A. & Lindsey, K. (2002) The POLARIS gene of arabidopsis encodes a predicted peptide required for correct root growth and leaf vascular patterning. The Plant Cell, 14(8), 1705–1721.
10.1105/tpc.002618
CAS PubMed Web of Science® Google Scholar
Chen, Q., Deng, B., Gao, J., Zhao, Z., Chen, Z., Song, S. et al. (2020) A miRNA-encoded small peptide, vvi-miPEP171d1, regulates adventitious root formation. Plant Physiology, 183(2), 656–670.
10.1104/pp.20.00197
CAS PubMed Web of Science® Google Scholar
Chen, Y., Yao, L., Wang, Y., Ji, X., Gao, Z., Zhang, S. et al. (2021) Identification of ribosomal protein L30 as an uncharacterized antimicrobial protein. Developmental and Comparative Immunology, 120, 104067.
10.1016/j.dci.2021.104067
CAS PubMed Google Scholar
Cheng, C.-Y., Krishnakumar, V., Chan, A.P., Thibaud-Nissen, F., Schobel, S. & Town, C.D. (2017) Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. The Plant Journal, 89(4), 789–804.
10.1111/tpj.13415
CAS PubMed Web of Science® Google Scholar
Coudert, E., Gehant, S., de Castro, E., Pozzato, M., Baratin, D., Neto, T. et al. (2022) Annotation of biologically relevant ligands in UniProtKB using ChEBI. Bioinformatics, 39(1), btac793. https://doi.org/10.1093/bioinformatics/btac793
10.1093/bioinformatics/btac793
Web of Science® Google Scholar
Cozzetto, D., Minneci, F., Currant, H. & Jones, D.T. (2016) FFPred 3: feature-based function prediction for all gene ontology domains. Scientific Reports, 6(1), 31865.
10.1038/srep31865
CAS PubMed Google Scholar
Deshpande, D., Chhugani, K., Chang, Y., Karlsberg, A., Loeffler, C., Zhang, J. et al. (2023) RNA-seq data science: from raw data to effective interpretation. Frontiers in Genetics, 14, 997383. https://doi.org/10.3389/fgene.2023.997383
10.3389/fgene.2023.997383
CAS PubMed Web of Science® Google Scholar
Dong, X., Zhang, K., Xun, C., Chu, T., Liang, S., Zeng, Y. et al. (2023) Small open reading frame-encoded micro-peptides: an emerging protein world. International Journal of Molecular Sciences, 24(13), 10562.
10.3390/ijms241310562
CAS PubMed Google Scholar
Dühring, U., Axmann, I.M., Hess, W.R. & Wilde, A. (2006) An internal antisense RNA regulates expression of the photosynthesis gene isiA. Proceedings of the National Academy of Sciences, 103(18), 7054–7058.
10.1073/pnas.0600927103
CAS PubMed Web of Science® Google Scholar
Elek, A., Kuzman, M. & Vlahoviček, K. (2019) Cordon: codon usage analysis and prediction of gene expressivity. Bioconductor, 3, 8.
Google Scholar
Ellendorff, U., Fradin, E.F., de Jonge, R. & Thomma, B.P.H.J. (2008) RNA silencing is required for arabidopsis defence against verticillium wilt disease. Journal of Experimental Botany, 60(2), 591–602.
10.1093/jxb/ern306
PubMed Google Scholar
Erb, M. & Kliebenstein, D.J. (2020) Plant secondary metabolites as defenses, regulators, and primary metabolites: the blurred functional trichotomy. Plant Physiology, 184(1), 39–52.
10.1104/pp.20.00433
CAS PubMed Web of Science® Google Scholar
Erokhina, T.N., Ryazantsev, D.Y., Zavriev, S.K. & Morozov, S.Y. (2023) Regulatory miPEP open reading frames contained in the primary transcripts of microRNAs. International Journal of Molecular Sciences, 24(3), 2114.
10.3390/ijms24032114
CAS PubMed Google Scholar
Fakih, Z., Plourde, M.B. & Germain, H. (2023) Differential participation of plant ribosomal proteins from the small ribosomal subunit in protein translation under stress. Biomolecules, 13(7), 1160.
10.3390/biom13071160
CAS PubMed Google Scholar
Fedoroff, N. (2002) RNA-binding proteins in plants: the tip of an iceberg? Current Opinion in Plant Biology, 5(5), 452–459.
10.1016/S1369-5266(02)00280-7
CAS PubMed Google Scholar
Fernández-Ortuño, D., Torés, J.A., de Vicente, A. & Pérez-García, A. (2008) Field resistance to QoI fungicides in podosphaera fusca is not supported by typical mutations in the mitochondrial cytochrome b gene. Pest Management Science, 64(7), 694–702.
10.1002/ps.1544
CAS PubMed Web of Science® Google Scholar
Gonzalez, D.H. (2016) Introduction to transcription factor structure and function. In: DH Gonzalez, editor. Plant transcription factors. Boston: Academic Press. pp. 3–11.
10.1016/B978-0-12-800854-6.00001-4
Google Scholar
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I. et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29(7), 644–652.
10.1038/nbt.1883
CAS PubMed Web of Science® Google Scholar
Großkinsky, D.K., Naseem, M., Abdelmohsen, U.R., Plickert, N., Engelke, T., Griebel, T. et al. (2011) Cytokinins mediate resistance against Pseudomonas syringae in tobacco through increased antimicrobial phytoalexin synthesis independent of salicylic acid signaling. Plant Physiology, 157(2), 815–830.
10.1104/pp.111.182931
CAS PubMed Google Scholar
Großkinsky, D.K., Tafner, R., Moreno, M.V., Stenglein, S.A., García de Salamone, I.E., Nelson, L.M. et al. (2016) Cytokinin production by pseudomonas fluorescens G20-18 determines biocontrol activity against pseudomonas syringae in arabidopsis. Scientific Reports, 6(1), 23310.
10.1038/srep23310
CAS PubMed Google Scholar
Han, Z., Xiong, D., Schneiter, R. & Tian, C. (2023) The function of plant PR1 and other members of the CAP protein superfamily in plant–pathogen interactions. Molecular Plant Pathology, 24(6), 651–668.
10.1111/mpp.13320
CAS PubMed Web of Science® Google Scholar
Hanada, K., Zhang, X., Borevitz, J.O., Li, W.-H. & Shiu, S.-H. (2007) A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. Genome Research, 17(5), 632–640.
10.1101/gr.5836207
CAS PubMed Web of Science® Google Scholar
von Heijne, G. (1990) The signal peptide. The Journal of Membrane Biology, 115(3), 195–201.
10.1007/BF01868635
PubMed Web of Science® Google Scholar
Howden, A.J.M., Stam, R., Martinez Heredia, V., Motion, G.B., ten Have, S., Hodge, K. et al. (2017) Quantitative analysis of the tomato nuclear proteome during phytophthora capsici infection unveils regulators of immunity. New Phytologist, 215(1), 309–322.
10.1111/nph.14540
CAS PubMed Web of Science® Google Scholar
Huang, Y., Cao, H., Yang, L., Chen, C., Shabala, L., Xiong, M. et al. (2019) Tissue-specific respiratory burst oxidase homolog-dependent H2O2 signaling to the plasma membrane H+-ATPase confers potassium uptake and salinity tolerance in cucurbitaceae. Journal of Experimental Botany, 70(20), 5879–5893.
10.1093/jxb/erz328
CAS PubMed Web of Science® Google Scholar
Jain, N., Richter, F., Adzhubei, I., Sharp, A.J. & Gelb, B.D. (2023) Small open reading frames: a comparative genetics approach to validation. BMC Genomics, 24(1), 226.
10.1186/s12864-023-09311-7
CAS PubMed Google Scholar
Jiang, Y., Jiang, L., Akhil, C.S., Wang, D., Zhang, Z., Zhang, W. et al. (2023) MULocDeep web service for protein localization prediction and visualization at subcellular and suborganellar levels. Nucleic Acids Research, 51(W1), W343–W349.
10.1093/nar/gkad374
CAS PubMed Google Scholar
Jiang, Y., Wang, D., Yao, Y., Eubel, H., Künzler, P., Møller, I.M. et al. (2021) MULocDeep: a deep-learning framework for protein subcellular and suborganellar localization prediction with residue-level interpretation. Computational and Structural Biotechnology Journal, 19, 4825–4839.
10.1016/j.csbj.2021.08.027
CAS PubMed Web of Science® Google Scholar
Kalvari, I., Nawrocki, E.P., Argasinska, J., Quinones-Olvera, N., Finn, R.D., Bateman, A. et al. (2018) Non-coding RNA analysis using the Rfam database. Current Protocols in Bioinformatics, 62(1), e51.
10.1002/cpbi.51
PubMed Google Scholar
Karberg, M., Guo, H., Zhong, J., Coon, R., Perutka, J. & Lambowitz, A.M. (2001) Group II introns as controllable gene targeting vectors for genetic manipulation of bacteria. Nature Biotechnology, 19(12), 1162–1167.
10.1038/nbt1201-1162
CAS PubMed Web of Science® Google Scholar
Kastenmayer, J.P., Ni, L., Chu, A., Kitchen, L.E., Au, W.-C., Yang, H. et al. (2006) Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Research, 16(3), 365–373.
10.1101/gr.4355406
CAS PubMed Web of Science® Google Scholar
Kaundal, R., Loaiza, C.D., Duhan, N. & Flann, N. (2022) deepHPI: a comprehensive deep learning platform for accurate prediction and visualization of host–pathogen protein–protein interactions. Briefings in Bioinformatics, 23(3), bbac125. https://doi.org/10.1093/bib/bbac125
10.1093/bib/bbac125
PubMed Google Scholar
Khandia, R., Singhal, S., Kumar, U., Ansari, A., Tiwari, R., Dhama, K. et al. (2019) Analysis of Nipah virus codon usage and adaptation to hosts. Frontiers in Microbiology, 10, 886. https://doi.org/10.3389/fmicb.2019.00886
10.3389/fmicb.2019.00886
PubMed Web of Science® Google Scholar
Kondo, T., Hashimoto, Y., Kato, K., Inagaki, S., Hayashi, S. & Kageyama, Y. (2007) Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA. Nature Cell Biology, 9(6), 660–665.
10.1038/ncb1595
CAS PubMed Web of Science® Google Scholar
Ladoukakis, E., Pereira, V., Magny, E.G., Eyre-Walker, A. & Couso, J. (2011) Hundreds of putatively functional small open reading frames in drosophila. Genome Biology, 12(11), R118.
10.1186/gb-2011-12-11-r118
CAS PubMed Web of Science® Google Scholar
Lambowitz, A.M. & Zimmerly, S. (2011) Group II introns: mobile ribozymes that invade DNA. Cold Spring Harbor Perspectives in Biology, 3(8), a003616.
10.1101/cshperspect.a003616
PubMed Web of Science® Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25.
10.1186/gb-2009-10-3-r25
CAS PubMed Web of Science® Google Scholar
Lee, C., Zeng, J., Drew, B.G., Sallam, T., Martin-Montalvo, A., Wan, J. et al. (2015) The mitochondrial-derived peptide MOTS-c promotes metabolic homeostasis and reduces obesity and insulin resistance. Cell Metabolism, 21(3), 443–454.
10.1016/j.cmet.2015.02.009
CAS PubMed Web of Science® Google Scholar
Li, Q., Li, H., Huang, W., Xu, Y., Zhou, Q., Wang, S. et al. (2019) A chromosome-scale genome assembly of cucumber (Cucumis sativus L.). GigaScience, 8(6), giz072. https://doi.org/10.1093/gigascience/giz072
10.1093/gigascience/giz072
PubMed Web of Science® Google Scholar
Lin, T.h, Murphy, R.F. & Bar-Joseph, Z. (2011) Discriminative motif finding for predicting protein subcellular localization. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(2), 441–451.
10.1109/TCBB.2009.82
CAS PubMed Google Scholar
Loaiza, C.D., Duhan, N., Lister, M. & Kaundal, R. (2020) In silico prediction of host–pathogen protein interactions in melioidosis pathogen Burkholderia pseudomallei and human reveals novel virulence factors and their targets. Briefings in Bioinformatics, 22(3), bbz162. https://doi.org/10.1093/bib/bbz162
10.1093/bib/bbz162
Google Scholar
Loaiza, C.D. & Kaundal, R. (2021) PredHPI: an integrated web server platform for the detection and visualization of host–pathogen interactions using sequence-based methods. Bioinformatics, 37(5), 622–624.
10.1093/bioinformatics/btaa862
CAS PubMed Google Scholar
Magny, E.G., Pueyo, J.I., Pearl, F.M.G., Cespedes, M.A., Niven, J.E., Bishop, S.A. et al. (2013) Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames. Science, 341(6150), 1116–1120.
10.1126/science.1238802
CAS PubMed Web of Science® Google Scholar
Malbert, B., Labaurie, V., Dorme, C. & Paget, E. (2023) Group I intron as a potential target for antifungal compounds: development of a trans-splicing high-throughput screening strategy. Molecules, 28(11), 4460.
10.3390/molecules28114460
CAS PubMed Google Scholar
Mat-Sharani, S. & Firdaus-Raih, M. (2019) Computational discovery and annotation of conserved small open reading frames in fungal genomes. BMC Bioinformatics, 19(13), 551.
10.1186/s12859-018-2550-2
PubMed Google Scholar
Matsumoto, A., Pasut, A., Matsumoto, M., Yamashita, R., Fung, J., Monteleone, E. et al. (2017) mTORC1 and muscle regeneration are regulated by the LINC00961-encoded SPAR polypeptide. Nature, 541(7636), 228–232.
10.1038/nature21034
CAS PubMed Web of Science® Google Scholar
Mei, S. (2012) Multi-label multi-kernel transfer learning for human protein subcellular localization. PLoS One, 7(6), e37716.
10.1371/journal.pone.0037716
CAS PubMed Google Scholar
Mouysset, J., Deichsel, A., Moser, S., Hoege, C., Hyman, A.A. & Gartner, A. et al. (2008) Cell cycle progression requires the CDC-48^UFD-1/NPL-4 complex for efficient DNA replication. Proceedings of the National Academy of Sciences, 105(35), 12879–12884.
10.1073/pnas.0805944105
CAS PubMed Google Scholar
Nair, R. (2003) NLSdb: database of nuclear localization signals. Nucleic Acids Research, 31(1), 397–399.
10.1093/nar/gkg001
CAS PubMed Web of Science® Google Scholar
Nanda, S., Rout, P., Ullah, I., Nag, S.R., Reddy, V.V., Kumar, G. et al. (2023) Genome-wide identification and molecular characterization of CRK gene family in cucumber (Cucumis sativus L.) under cold stress and sclerotium rolfsii infection. BMC Genomics, 24(1), 219.
10.1186/s12864-023-09319-z
CAS PubMed Google Scholar
Nandi, I., Aroeti, L., Ramachandran, R.P., Kassa, E.G., Zlotkin-Rivkin, E. & Aroeti, B. (2021) TypeIIIsecreted effectors that target mitochondria. Cellular Microbiology, 23(9), e13352.
10.1111/cmi.13352
CAS PubMed Web of Science® Google Scholar
Nawrocki, E.P. & Eddy, S.R. (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 29(22), 2933–2935.
10.1093/bioinformatics/btt509
CAS PubMed Web of Science® Google Scholar
Nielsen, H. & Johansen, S.D. (2009) Group I introns: moving in new directions. RNA Biology, 6(4), 375–383.
10.4161/rna.6.4.9334
CAS PubMed Web of Science® Google Scholar
Ong, S.N., Tan, B.C., Al-Idrus, A. & Teo, C.H. (2022) Small open reading frames in plant research: from prediction to functional characterization. 3 Biotech, 12(3), 76.
10.1007/s13205-022-03147-w
PubMed Google Scholar
Petibon, C., Malik Ghulam, M., Catala, M. & Abou Elela, S. (2021) Regulation of ribosomal protein genes: an ordered anarchy. WIREs RNA, 12(3), e1632.
10.1002/wrna.1632
CAS PubMed Web of Science® Google Scholar
Powell, J.R. & Moriyama, E.N. (1997) Evolution of codon usage bias in Drosophila. Proceedings of the National Academy of Sciences, 94(15), 7784–7790.
10.1073/pnas.94.15.7784
CAS PubMed Google Scholar
Pruitt, K.D., Tatusova, T. & Maglott, D.R. (2006) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 35(suppl_1), D61–D65.
PubMed Google Scholar
Quinlan, A.R. & Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842.
10.1093/bioinformatics/btq033
CAS PubMed Web of Science® Google Scholar
Rice, P., Longden, I. & Bleasby, A. (2000) EMBOSS: the european molecular biology open software suite. Trends in Genetics, 16(6), 276–277.
10.1016/S0168-9525(00)02024-2
CAS PubMed Web of Science® Google Scholar
Rozov, S. & Deineko, E. (2022) Increasing the efficiency of the accumulation of recombinant proteins in plant cells: the role of transport signal peptides. Plants, 11(19), 2561.
10.3390/plants11192561
CAS Google Scholar
Salamov, A.A. & Solovyev, V.V. (2000) Ab initio gene finding in drosophila genomic DNA. Genome Research, 10(4), 516–522.
10.1101/gr.10.4.516
CAS PubMed Web of Science® Google Scholar
Schuler, G.D., Boguski, M.S., Stewart, E.A., Stein, L.D., Gyapay, G., Rice, K. et al. (1996) A gene map of the human genome. Science, 274(5287), 540–546.
10.1126/science.274.5287.540
CAS PubMed Web of Science® Google Scholar
Sharp, P.M. & Li, W.-H. (1987) The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research, 15(3), 1281–1295.
10.1093/nar/15.3.1281
CAS PubMed Web of Science® Google Scholar
Slavoff, S.A., Mitchell, A.J., Schwaid, A.G., Cabili, M.N., Ma, J., Levin, J.Z. et al. (2013) Peptidomic discovery of short open reading frame–encoded peptides in human cells. Nature Chemical Biology, 9(1), 59–64.
10.1038/nchembio.1120
CAS PubMed Web of Science® Google Scholar
Sperschneider, J., Dodds, P.N., Singh, K.B. & Taylor, J.M. (2018) ApoplastP: prediction of effectors and plant proteins in the apoplast using machine learning. New Phytologist, 217(4), 1764–1778.
10.1111/nph.14946
CAS PubMed Web of Science® Google Scholar
Stein, L.D. (2004) End of the beginning. Nature, 431(7011), 915–916.
10.1038/431915a
CAS PubMed Web of Science® Google Scholar
Stolc, V., Li, L., Wang, X., Li, X., Su, N., Tongprasit, W. et al. (2005) A pilot study of transcription unit analysis in rice using oligonucleotide tiling-path microarray. Plant Molecular Biology, 59(1), 137–149.
10.1007/s11103-005-6164-5
CAS PubMed Google Scholar
Sukumari Nath, V., Kumar Mishra, A., Kumar, A., Matoušek, J. & Jakše, J. (2019) Revisiting the role of transcription factors in coordinating the defense response against citrus bark cracking viroid infection in commercial hop (Humulus Lupulus L.). Viruses, 11(5), 419.
10.3390/v11050419
PubMed Google Scholar
Taft, R.J., Pheasant, M. & Mattick, J.S. (2007) The relationship between non-protein-coding DNA and eukaryotic complexity. BioEssays, 29(3), 288–299.
10.1002/bies.20544
CAS PubMed Web of Science® Google Scholar
Taleski, M., Imin, N. & Djordjevic, M.A. (2018) CEP peptide hormones: key players in orchestrating nitrogen-demand signalling, root nodulation, and lateral root development. Journal of Experimental Botany, 69(8), 1829–1836.
10.1093/jxb/ery037
CAS PubMed Web of Science® Google Scholar
Tessier, T.M., MacNeil, K.M. & Mymryk, J.S. (2020) Piggybacking on classical import and other non-classical mechanisms of nuclear import appear highly prevalent within the human proteome. Biology, 9(8), 188.
10.3390/biology9080188
CAS PubMed Google Scholar
Ullu, E. & Tschudi, C. (1984) Alu sequences are processed 7SL RNA genes. Nature, 312(5990), 171–172.
10.1038/312171a0
CAS PubMed Web of Science® Google Scholar
Uozumi, T., Hamakawa, M., Deno, Y., Nakajo, N. & Hirotsu, T. (2015) Voltage-dependent anion channel (VDAC-1) is required for olfactory sensing in Caenorhabditis elegans. Genes to Cells, 20(10), 802–816.
10.1111/gtc.12269
CAS PubMed Web of Science® Google Scholar
Voulhoux, R. (2001) Involvement of the twin-arginine translocation system in protein secretion via the type II pathway. The EMBO Journal, 20(23), 6735–6741.
10.1093/emboj/20.23.6735
CAS PubMed Google Scholar
Wang, S., Tian, L., Liu, H., Li, X., Zhang, J., Chen, X. et al. (2020) Large-scale discovery of non-conventional peptides in maize and Arabidopsis through an integrated peptidogenomic pipeline. Molecular Plant, 13(7), 1078–1093.
10.1016/j.molp.2020.05.012
CAS PubMed Google Scholar
Wang, X., Cheng, C., Zhang, K., Tian, Z., Xu, J., Yang, S. et al. (2018) Comparative transcriptomics reveals suppressed expression of genes related to auxin and the cell cycle contributes to the resistance of cucumber against meloidogyne incognita. BMC Genomics, 19(1), 583.
10.1186/s12864-018-4979-0
PubMed Google Scholar
Wu, S., Guo, B., Zhang, L., Zhu, X., Zhao, P., Deng, J. et al. (2022) A micropeptide XBP1SBM encoded by lncRNA promotes angiogenesis and metastasis of TNBC via XBP1s pathway. Oncogene, 41(15), 2163–2172.
10.1038/s41388-022-02229-6
CAS PubMed Web of Science® Google Scholar
Xu, Q., Xu, X., Shi, Y., Qi, X. & Chen, X. (2017) Elucidation of the molecular responses of a cucumber segment substitution line carrying Pm5.1 and its recurrent parent triggered by powdery mildew by comparative transcriptome profiling. BMC Genomics, 18(1), 21.
10.1186/s12864-016-3438-z
PubMed Web of Science® Google Scholar
Zhang, X., Zhao, H., Gao, S., Wang, W.-C., Katiyar-Agarwal, S., Huang, H.-D. et al. (2011) Arabidopsis argonaute 2 regulates innate immunity via miRNA393-mediated silencing of a golgi-localized SNARE gene, MEMB12. Molecular Cell, 42(3), 356–366.
10.1016/j.molcel.2011.04.010
CAS PubMed Web of Science® Google Scholar
Zhou, Z., Dang, Y., Zhou, M., Li, L., Yu, C.-h & Fu, J. et al. (2016) Codon usage is an important determinant of gene expression levels largely through its effects on transcription. Proceedings of the National Academy of Sciences, 113(41), E6117–E6125.
10.1073/pnas.1606724113
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume47, Issue12

December 2024

Pages 5330-5342

Filename	Description
pce15104-sup-0001-Additional_file_1.txt3.6 KB	Supporting information.
pce15104-sup-0002-Additional_file_2.txt72.5 MB	Supporting information.
pce15104-sup-0003-Additional_file_3.txt71.6 MB	Supporting information.
pce15104-sup-0004-Additional_file_4.txt1,011.5 KB	Supporting information.
pce15104-sup-0005-Additional_file_5.txt278.3 KB	Supporting information.
pce15104-sup-0006-Additional_file_6.txt59.8 MB	Supporting information.
pce15104-sup-0007-Additional_file_7.docx15 KB	Supporting information.
pce15104-sup-0008-Additional_file_8.xlsx7.6 KB	Supporting information.
pce15104-sup-0009-Additional_file_9.xlsx629.1 KB	Supporting information.
pce15104-sup-0010-Additional_file_10.xlsx372.2 KB	Supporting information.
pce15104-sup-0011-Additional_file_11.xlsx13.8 KB	Supporting information.
pce15104-sup-0012-Additional_file_12.xlsx14.1 KB	Supporting information.
pce15104-sup-0013-Additional_file_13.csv3.8 KB	Supporting information.
pce15104-sup-0014-Additional_file_14.csv4.6 KB	Supporting information.
pce15104-sup-0015-Additional_file_15.csv3.1 KB	Supporting information.
pce15104-sup-0016-Additional_file_16.csv3.1 KB	Supporting information.

In-silico identification of putatively functional intergenic small open reading frames in the cucumber genome and their predicted response to biotic and abiotic stresses

Abstract