Volume 43, Issue 2 pp. 205-212
Free Access

Orphan transcripts in Arabidopsis thaliana: identification of several hundred previously unrecognized genes

Diego Mauricio Riaño-Pachón

Diego Mauricio Riaño-Pachón

Department of Molecular Biology, Institute for Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 25, Haus 20, D-14476 Golm/Potsdam, Germany, and

Cooperative Research Group, Max-Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, D-14476 Golm/Potsdam, Germany

Search for more papers by this author
Ingo Dreyer

Ingo Dreyer

Department of Molecular Biology, Institute for Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 25, Haus 20, D-14476 Golm/Potsdam, Germany, and

Cooperative Research Group, Max-Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, D-14476 Golm/Potsdam, Germany

Search for more papers by this author
Bernd Mueller-Roeber

Corresponding Author

Bernd Mueller-Roeber

Department of Molecular Biology, Institute for Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 25, Haus 20, D-14476 Golm/Potsdam, Germany, and

Cooperative Research Group, Max-Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, D-14476 Golm/Potsdam, Germany

For correspondence (fax +49 331 977 2512; e-mail: [email protected]).Search for more papers by this author
First published: 10 June 2005
Citations: 16

Summary

Expressed sequence tags (ESTs) represent a huge resource for the discovery of previously unknown genetic information and functional genome assignment. In this study we screened a collection of 178 292 ESTs from Arabidopsis thaliana by testing them against previously annotated genes of the Arabidopsis genome. We identified several hundreds of new transcripts that match the Arabidopsis genome at so far unassigned loci. The transcriptional activity of these loci was independently confirmed by comparison with the Salk Whole Genome Array Data. To a large extent, the newly identified transcriptionally active genomic regions do not encode ‘classic’ proteins, but instead generate non-coding RNAs and/or small peptide-coding RNAs of presently unknown biological function. More than 560 transcripts identified in this study are not represented by the Affymetrix GeneChip arrays currently widely used for expression profiling in A. thaliana. Our data strongly support the hypothesis that numerous previously unknown genes exist in the Arabidopsis genome.

Introduction

Life relies on the activity of thousands of genes and their gene products (proteins, rRNAs, tRNAs, microRNAs and other non-coding RNAs) that coordinately and dynamically interact at the cellular and whole-organism level to establish developmental and physiological processes. Today, the genome composition of an appreciable number of organisms is known, including more than 150 bacteria and archaea, as well as several uni- and multicellular eukaryotes. A full understanding of the genome's complexity and activity requires that all gene products of a genome are identified and assigned with respect to their biological function. Therefore, the large-scale identification of expressed genes has been a major task for researchers over the last years. For more than a decade (Adams et al., 1991) expressed sequence tags (ESTs) were obtained from a large number of organisms and tissues, stored in public and private databases, and often employed as a source for gene identification. Because ESTs are derived from cDNA (and hence mRNA) populations, the vast majority of them code for proteins. In addition, it is now also evident that mRNA species exist that encode relatively short peptides, that may exert important biological functions (Cock and McCormick, 2001; Olsen et al., 2002; Wen et al., 2004). Thus, one can therefore expect that a certain fraction of available ESTs code for such small peptides. Because of experimental design, cDNA collections may also contain molecules that are not derived from mRNAs, but rather from non-coding RNAs, including rRNAs, tRNAs and others.

Recently, a DNA microarray-based technology has allowed to measure the transcriptional activity of a plant's complete genome (Whole Genome Array, Yamada et al., 2003). The data obtained with such a genome tiling array provide an unbiased measurement of the transcriptional activity of each individual genomic region, even if not annotated yet as a functional gene. However, whole-genome tiling arrays are currently not commercially available, excluding their broad exploitation in most research laboratories. Besides whole-genome tiling arrays, the analysis of existing EST data may assist in the identification of previously unrecognized transcribed genomic regions. Because genome annotation often relies on the presence of relatively long open reading frames (and hence protein-coding sequences) derived from cDNA/EST sequences, non-coding gene products and peptide-coding genes are often overlooked. Here, we specifically screened for ESTs that could be assigned to their genomic counterparts but for which no underlying genes have been described before. Previous to this study, different groups have employed EST data to refine the annotation of the Arabidopsis thaliana genome (Wortman et al., 2003; Zhu et al., 2003). These groups had concentrated their efforts on the annotation of protein-coding genes, and did not analyse non-annotated transcribed regions to a large extent. Here, we describe a systematic search for transcribed regions of the Arabidopsis genome which have not been annotated before. Information on these ‘orphan’ transcripts is provided in an easily browsable database.

Results and discussion

Transcriptional activity of new locations of the Arabidopsis genome

Expressed sequence tags are generally derived from transcribed mRNA and, hence, represent the fraction of the genome that is transcriptionally active. To extract and analyse ESTs for which no annotation in the Arabidopsis genome has been reported before we employed the following step-wise screening protocol (Figure 1).

Details are in the caption following the image

Flow scheme to illustrate the identification of orphan expressed sequence tags (ESTs). Computational processes are represented by shadowed rectangles, sequences by round boxes. Details are given in the text.

The complete A. thaliana EST collection was downloaded from The Arabidopsis Information Resource (TAIR, http://www.arabidopsis.org, accessed on: 15 February 2003. 178 292 sequences). A first inspection revealed that 195 of these sequences were duplicated entries; consequently they were removed from the data set (Figure 1(i)). The remaining sequences (178 097 ESTs) were clustered by means of the stackPACKTM system (Miller et al., 1999). The clustering allowed to group and align overlapping ESTs into contigs that are represented as consensus sequences. At the end of this step 148 666 ESTs were clustered in 20 717 contigs, while 29 431 ESTs remained as single sequences (Figure 1(ii)). To identify among these 50 148 sequences those ESTs and contigs that are represented by known and already annotated genes, a series of blast searches was performed against several databases using various search criteria (Table 1). These combined approaches allowed to link 46 205 of the 50 148 sequences to known annotated genes (Figure 1(iii)). These sequences were discarded from the data set. The remaining 3943 sequences were analysed further in a two-step process in order to find the chromosomal locations within the Arabidopsis genome. First, the preliminary chromosomal coordinates of each sequence was determined by performing a blast search against chromosomal sequences downloaded from The Institute for Genomic Research website (TIGR, http://www.tigr.org, release: January 2004). An alignment of at least 50% of the length of the query sequence against a chromosomal region with an e-value smaller than 10−4 was arbitrarily chosen as the cut-off criterion for locating a sequence on the chromosome. Subsequently, for positive hits, the genomic region reported by blast was extended by 2000 bases on each side. The expanded region was extracted from the chromosome-sequence database using the program ‘extractseq’ from the EMBOSS package (Rice et al., 2000), and the extracted region was then aligned to the corresponding EST using the program ‘est2genome’. This program aligns spliced DNA (e.g. EST) to non-spliced DNA allowing the recognition of introns of arbitrary length and the detection of splicing sites (Mott, 1997). This strategy, however, failed to locate all sequences on the five Arabidopsis chromosomes. Therefore, to pinpoint the location of the remaining sequences on the chromosomes an alternative strategy was applied. In the first step a modified blast search against the chromosomal sequences was conducted, collecting all hits with an e-value smaller than 10−3. In the second step, all those hits for a given sequence that were located on both, the same chromosome and the same DNA strand, were joined, and the maximum and minimum chromosome locations reported by blast were extracted. In step 3 the entire region flanked by these coordinates, expanded by 2000 bases on each side, was extracted from the chromosome and used for an alignment with the expressed sequences employing ‘est2genome’ (Figure 1(iv)). Collectively, 2305 previously unmatched sequences were precisely located on the chromosomes. The remaining 1638 sequences that could not be localized accurately were excluded from further analyses.

Table 1. Databases and criteria used to perform blast searches. The first and second series of searches were performed as described by the Schroeder lab (http://www-biology.ucsd.edu/labs/schroeder/howandwhy.html), initially used to match Affymetrix probes to the Arabidopsis thaliana genome. Search criteria for series 3–7 were established in-house and employed softer, but still reliable, criteria to identify remaining sequences. The sixth and seventh searches differ in that the former needs an alignment with more than 70% of amino acid identity, while the last requires at least 70% conserved amino acids; conservation is evaluated using the substitution matrix BLOSUM62
Series Databases searched against Criteria to consider a hit as identified Identified Remaining, not identified
1 Predicted coding sequences from TIGRa
cDNA Salk Collection
Gene sequences, including introns and UTRsa
Bases aligned ≥100
Identity >98%
37258 12890
2 Predicted coding sequences from TIGR
Gene sequences, including introns and UTRs
Bases aligned ≥50
Identity = 100%
1122 11768
3 Same as in 1 Bases aligned ≥300
Identity >96%
1992 9776
4 Same as in 1 Aligned ≥50% of sequence
Identity >70%
e-value <0.001
5267 4509
5 Predicted coding sequences from chloroplasts
Predicted coding sequences from mitochondria
Same as in 4 70 4439
6 UniProtb Same as in 4 362 4077
7 Same as in 6 Aligned ≥50% of sequence
Conservation >70%
e-value <0.001
134 3943
Total 46205

The genomic counterparts of the 2305 sequences were extracted, and used for further studies. The coordinates of each sequence were compared against chromosome tables obtained from XML files from TIGR website (http://www.tigr.org, release: January 2004), which contain the chromosomal coordinates of each annotated gene. This comparison allowed to identify regions with more than 10% overlap between expressed sequences and known genes in 1707 of the 2305 cases (Figure 1(v)). These sequences were considered as identified transcripts of known genes and consequently excluded from further analysis. The set of the remaining 598 sequences is redundant in the sense that some sequences are subsequences of others, and that there are overlapping pairs which represent the 3′ and 5′ end of one single transcript. After merging (Figure 1(vi)), 558 transcripts remained. Considering that some of these sequences represent more than one EST (because some ESTs were joined into clusters by the stackPACKTM system), 879 original ESTs remained, representing 0.49% of the original set of EST sequences (178 292).

Thirty of the 558 sequences were mapped to more than one location of the Arabidopsis genome: 28 sequences were assigned to two genomic loci, one sequence was assigned to three loci, and one sequence was assigned to four loci. With one of these locations the similarity was always close to 100%, while it was more divergent (78 ± 12% identity) when compared with the other loci, strongly suggesting that these secondary assignments correspond to duplicated genes (for an example see Figure 2). In total, 591 loci were discovered (Figure 1(vii)), matching 558 previously unassigned transcripts. In the following, we concentrated our analysis on this set of genomic regions, and called them orphan transcripts, abbreviated as At_oRNA_xxx (with xxx representing any number between 001 and 591. We established a database, AtoRNADB, available via the World Wide Web at http://atornadb.bio.uni-potsdam.de/, to display sequences and sequence features of the orphan transcripts, alignments of original ESTs with their genomic source counterparts and chromosomal positions of the orphan transcripts. Additional information describing features of the orphan transcripts is provided in the Supplementary Material S1–S10 which are available in the online version of the journal.

Details are in the caption following the image

The sequence GI: 9784288 was assigned to two different genomic loci. Polydotplot of EST GI:9784288 and its corresponding genomic loci (At_oRNA_390, and At_oRNA_391). A line represents the region of similarity between the two compared sequences.

The percentage of orphan transcripts represented by a single EST was 69.5% of the whole pool. Orphan transcripts detected by two, three, four and five ESTs accounted for 18.1, 6.8, 2.2 and 0.8% of the pool, respectively. The remaining orphan transcripts (2.6%) were represented by more than five (up to 27) ESTs. Therefore, most of the identified transcripts arose from loci that are expressed at a relatively weak to moderate level. Evidence for transcription was also obtained by comparison with the data generated by the Arabidopsis Massively Parallel Signature Sequencing (MPSS) program (http://mpss.udel.edu/at; download November 2004). MPSS generates short sequence signatures for a defined position within an mRNA. The abundance of these signatures in a given cDNA library indicates the expression level of that gene. Only 35 orphan transcripts produced hits against the collection of 17-bp long MPSS tags, consistent with a relatively low expression level of the identified loci, or their cell-specific expression patterns. However, it cannot be excluded that some of the orphan transcripts represent only a fragment of the transcriptional unit they originated from. Therefore, we extended the sequence of each orphan transcript over 500 bp at its 3′-end, and compared this enlarged unit with the MPSS signatures. This approach allowed us to assign an MPSS signature to 341 orphan transcripts. Among these 341 MPSS signatures 164 (48.1%) have not been assigned to any known gene in A. thaliana. In 130 additional cases the orphan transcript and the associated annotated gene carrying the MPSS signature were on opposite strands suggesting that the nearest MPSS signature did not belong to the orphan transcript. In 10 cases the MPSS signatures found were assigned to an annotated gene on a different chromosome. In this context it should be noted that in a related recent study MPSS data were used to identify unknown genes in Arabidopsis (Meyers et al., 2004). The analyses by Meyers et al. (2004) and the data presented in our study clearly indicate the presence of orphan transcripts and therefore the existence of several overlooked genes in the Arabidopsis genome. The lack of a complete overlap of the orphan transcripts detected in the present study with the MPSS data set illustrates the complementary character of the two different approaches (using EST and MPSS data).

The 591 sequences are evenly distributed along the five Arabidopsis chromosomes, with the exception of the centromeric regions and a high peak on chromosome 2, at the short arm close to the centromere. The alignment of 121 of the 591 sequences required the prediction of an intron to obtain a reliable alignment with the genome, showing that these transcripts were processed after transcription. Additionally, only nine orphan transcripts are highly conserved between A. thaliana and Oryza sativa (rice), including At_oRNA_141 (see below). Thirty-two additional transcripts are weakly conserved between the two species (Supplementary Material S10).

Experimental confirmation of transcriptional activity

Inspection of the Salk Whole Genome Array data (WGA; Yamada et al., 2003) allowed us to independently verify the transcriptional activity of 587 of the 591 identified genomic locations. The WGA repository contains information about the transcriptional activity of the whole Arabidopsis genome in five different experimental conditions: (i) light-grown seedlings, (ii) anthers, (iii) flowers (mixed stages), (iv) roots and (v) suspension cell culture. Some of the orphan transcripts (e.g. At_oRNA_115 and At_oRNA_534) showed a relatively high expression level in several experiments assayed with the WGA. Some others were preferentially expressed only in some ‘experiments’ (i.e. tissues): At_oRNA_311, for example, showed a higher hybridization signal in light-grown seedlings and roots, than in suspension cell culture, anthers and flowers (mixed stages). Four orphan transcripts lacked any hybridization signal in the WGA data. However, for two of them (At_oRNA_466 and At_oRNA_565) other supporting evidence for transcriptional activity (cDNA or MPSS data) was obtained. Of the remaining two orphan transcripts one (At_oRNA_467) is a predicted protein-coding RNA (see below), and the other (At_oRNA_566) corresponds to a known bacterial vector sequence. Surprisingly, our analysis also identified two further vector sequences in the Arabidopsis genome (At_oRNA_141 and At_oRNA_413), which, according to WGA data, are transcribed.

Thus, for at least 99.6% of the 591 unassigned genomic loci further independent experimental evidence for transcriptional activity was obtained.

A subset of sequences represents known ncRNAs

To further characterize the orphan transcripts, a series of blast searches against different databases was conducted. Details of the results are provided in Supplementary Material S1. In the following we present a short summary of the main results (Tables 2 and 3): (i) As mentioned above, three sequences had perfect matches with bacterial vector sequences which might be regarded as an error in the assembled Arabidopsis genomic sequence. (ii) Seven sequences were assigned to ribosomal RNAs in A. thaliana. (iii) One sequence was found to represent a tRNA, and (iv) 21 sequences were correlated with other known non-coding RNAs (ncRNAs). Among them were sequences which were similar to known small nucleolar RNAs (snoRNAs; Brown et al., 2003b); and some of these carried more than one snoRNA (Figure 3; Supplementary Material S2) reflecting the fact that some snoRNAs are processed from polycistronic pre-snoRNAs (Brown et al., 2003a; Leader et al., 1997).

Table 2. Orphan transcripts identified as ncRNAs through blast searches. Summary of the results obtained with blast searches; number of sequences per category and identifiers in At oRNA DB. For details see Supplementary Material S1 and http://atornadb.bio.uni-potsdam.de
Category Number of seqs Identifier
Vectors 3 At_oRNA_141, At_oRNA_413, At_oRNA_566
rRNA 7 At_oRNA_092, At_oRNA_093, At_oRNA_372, At_oRNA_422, At_oRNA_423, At_oRNA_567, At_oRNA_568
tRNA 1(1)a At_oRNA_426, (At_oRNA_425)a
Other ncRNAs 21 At_oRNA_017, At_oRNA_119, At_oRNA_125, At_oRNA_134, At_oRNA_139, At_oRNA_140, At_oRNA_290, At_oRNA_401, At_oRNA_439, At_oRNA_485, At_oRNA_486, At_oRNA_500, At_oRNA_510, At_oRNA_513, At_oRNA_522, At_oRNA_542, At_oRNA_548, At_oRNA_549, At_oRNA_555, At_oRNA_557, At_oRNA_587
  • aThe sequence At_oRNA_425 matched to a tRNA using a blast search, but the result could not be confirmed by the program tRNAscan-SE (Lowe and Eddy, 1997). Both, At_oRNA_425 and At_oRNA_426 are two genomic assignments for the same transcript. Therefore, it is possible that At_oRNA_425 is a non-functional version of the gene represented by At_oRNA_426.
Table 3. Orphan transcripts identified as ncRNAs through Rfam searches. The data set of orphan transcripts was searched against a collection of covariance models (Rfam; http://www.sanger.ac.uk/Software/Rfam/). The cut-off parameters used to decide a significant match were those employed to build the corresponding RNA family, as reported for each model in Rfam
Sequence Model name Model number Rfam 6.0 score
At_oRNA_119 U14 RF00016 66.08
At_oRNA_290a Intron_gpII RF00029 45.34
At_oRNA_401 U25 RF00054 49.76
At_oRNA_422 SSU_rRNA_5 RF00177 447.29
At_oRNA_423 SSU_rRNA_5 RF00177 448.15
At_oRNA_426 tRNA RF00005 57.32
At_oRNA_439 U14 RF00016 62.67
At_oRNA_500 snoR37 RF00213 41.97
At_oRNA_513 snoZ7 RF00268 37.61
At_oRNA_548 snoZ223 RF00135 59.85
At_oRNA_549 snoZ223 RF00135 59.85
At_oRNA_557 snoZ105 RF00145 34.41
U15 RF00067 39.42
At_oRNA_587 snoZ195 RF00133 9.98
  • Rfam searches missed the sequences representing the 28S rRNA, and most of the snoRNAs.
  • aThe sequence At_oRNA_290 was not found with a blast search.
Details are in the caption following the image

The transcript At_oRNA_587 is a polycistronic pre-snoRNA. It carries four different snoRNAs, all belonging to the class Box C/D according to Brown et al. (2003b).

Besides sequence similarities, the secondary structure of the RNA may also provide useful information to assign orphan transcripts to existing RNA families. Therefore, we compared our data set with structural models for RNA families (Griffiths-Jones et al., 2003), deposited in the Rfam database (Table 3). Rfam confirmed the results obtained by blast searches for 12 of the orphan transcripts. Additionally, one orphan transcript was assigned to an RNA family that had not been assigned before with blast searches. Conversely, some ncRNAs that were identified through blast were not identified by Rfam. Therefore, blast and Rfam searches complemented each other.

The Arabidopsis Small RNA Project (http://cgrb.orst.edu/smallRNA/db/) collects experimental information about small RNAs in A. thaliana. Eleven of the orphan transcripts identified in this work were similar to small RNAs (Supplementary Material S1), indicating that either they are themselves small RNAs, or alternatively that they are targeted by small RNAs. Experimental evidence for small RNAs is increasing rapidly. Therefore, we expect that the number of orphan transcripts matching small RNAs will rise in the near feature.

Search for known protein motifs in orphan transcript-encoded polypeptides

In the final step we tested whether any of the orphan transcripts identified in our screen would encode a protein with known protein motifs. Therefore, sequences of orphan transcripts were translated into all six reading frames and the deduced peptides/proteins were scanned by means of the InterProScan program (Zdobnov and Apweiler, 2001) against several pattern and profile databases (see Experimental procedures). Only 42 sequences matched to known protein motifs (Supplementary Material S3). We therefore applied the three gene prediction programs Unveil (Majoros et al., 2003), Genscan (Burge and Karlin, 1997) and GlimmerM (Majoros et al., 2003) to analyse the 591-sequence data set for transcripts exhibiting a sequence bias characteristic of protein-coding genes. We found that 192 of 591 sequences had exons predicted by at least two of the programs. When comparing the results obtained from the protein motif search and the gene prediction programs, all sequences with predicted motifs/domains were found to be present in the set of predicted exons. Additionally, two sequences, for which exons were predicted, were similar to snoRNAs. In one case (At_oRNA_125) the snoRNA appeared to match a sequence predicted to be an intron by the gene prediction programs. In the other case (At_oRNA_513) the snoRNA overlapped with one predicted exon. Both cases are rare in A. thaliana because intronic snoRNAs are not common in this genome (Brown et al., 2003a).

Representation of orphan sequences on the Affymetrix chip array

Transcript profiling is often used to study simultaneously the activity status of thousands of genes. A prominent method employs Affymetrix GeneChip arrays. The Arabidopsis AtGenome1 Array was the first-generation Arabidopsis array which measured the relative transcript levels of approximately 8300 genes. The more recent ATH1 Arabidopsis Genome Array contains more than 22 500 so-called probe sets representing approximately 24 000 Arabidopsis genes (Liu et al., 2003). We were interested to know how many of the orphan transcripts identified here are represented by the arrays. Our analysis found that only 18 of 591 sequences (3%) are represented by chip AtGenome1, 10 of them matching five or more probes (Supplementary Material S4), and only five of the 591 sequences (<1%) are represented by the ATH1 Genome Array; none by more than three probes. Thus, more than 580 transcribed genomic regions are not yet served by the Affymetrix ATH1 Genome Array.

Conclusions

The Arabidopsis genome contains expressed loci which have been almost totally overlooked so far. In this study, 591 expressed, unrecognized genomic loci were identified. Twenty-nine of these produce non-coding RNAs of different classes, 192 transcripts have characteristics typical of protein-coding genes. The remaining 369 transcripts do not have matches to any known gene, neither protein-coding nor non-coding. The many overlooked Arabidopsis genes that we discovered here may provide a fertile ground for further experimental analyses. In this context, the set of Arabidopsis non-coding RNAs may be of particular interest. Our data analysis provides independent proof for the presence of a large number of transcribed regions of the Arabidopsis genome and confirms recent experimental results obtained by testing the Arabidopsis transcriptional activity on a genome-wide scale, using whole-genome tiling arrays (Yamada et al., 2003). In addition, they provide information about the minimal length of the transcribed regions which apparently is more difficult to extract from tiling arrays alone. They also deliver information about exon–intron structures and the precise fusion of exons in the mature transcripts. Importantly, the vast majority of transcriptionally active genes discovered here on the basis of EST data does not overlap with the transcriptionally active genomic loci identified using MPSS (Meyers et al., 2004), and also does most likely not include micro-RNA genes. Hence, the sofar discovered transcriptionally active regions of the Arabidopsis genome almost certainly do not cover the complete set of active genes in the plant. Future work has to include (i) the analysis of the transcriptional activity in highly differentiated (but often under-represented) cell types, and (ii) the search for transcripts that typically do not harbour poly-(A)-tails at their 3′-ends to include those originating from atypical transcriptional activities.

Recent data indicate that large portions of human chromosomes 21 and 22 are transcribed into non-coding RNAs (Kampa et al., 2004; Kapranov et al., 2002), and Sémon and Duret (2004) provided evidence that functional transcription units cover at least half of the human genome. Also, analysis of the mouse transcriptome, based on the functional annotation of more than 60 770 cDNAs, revealed a large number of non-coding messages (Okazaki et al., 2002). The functional role of the vast majority of the transcripts remains almost totally unexplored at the current stage, but interestingly, non-coding RNAs are emerging as a rapidly growing class of regulatory transcripts important for a number of biological processes, including translational control, abiotic and biotic signalling, differentiation and others (Erdmann et al., 2001; Hüttenhofer et al., 2002).

Experimental procedures

All procedures were run on an Intel Pentium 4 computer powered by SuSe Linux 8.1. The stackPACKTM system (Miller et al., 1999) was employed to cluster sequences based on overlapping regions and to obtain consensus sequences (for a description of the clustering procedure see the stackPACKTM Manual). blast databases were created and formatted locally, and standalone blast programs were run. Results were filtered by means of PERL scripts specially written for this purpose, using BioPerl modules (http://www.bioperl.org). Rfam database files (Griffiths-Jones et al., 2003), release 6.0, were downloaded from http://www.sanger.ac.uk/Software/Rfam/ftp.shtml. Rfam is a collection of Stochastic Context Free Grammars (SCFG) of RNA families. Before the SCFG search, a blast search for each orphan transcript against the sequences belonging to each SCFG was carried out. This allowed us to perform the SCFG search only with those models which had hits with orphan transcripts. The SCFG search was performed employing the program INFERNAL (Eddy, 2002), using as thresholds the values found in the threshold file from the Rfam database. InterProScan (Zdobnov and Apweiler, 2001) was employed to find known protein motifs in orphan transcripts computationally translated in all possible reading frames. Six-frame translated orphan transcripts were scanned against all InterPro member databases (http://www.ebi.ac.uk/InterProScan). Orphan transcripts were analysed with three different gene prediction programs, Unveil (Majoros et al., 2003), Genscan (Burge and Karlin, 1997) and GlimmerM (Majoros et al., 2003). All three programs were prepared to detect Arabidopsis genes, with the training sets provided by the authors of the programs.

Acknowledgements

Financial support was provided by the ‘Ministerium für Wissenschaft, Forschung und Kultur’ (MWFK) of the State Brandenburg, Germany, and the Interdisciplinary Research Centre ‘Advanced Protein Technologies’ (IZ-APT) of the University of Potsdam. Bernd Mueller-Roeber thanks the Fonds der Chemischen Industrie for financial support (no. 0164389). The authors are grateful to Dr Roy O'Mahony for comments on the manuscript and to three anonymous reviewers for helpful comments.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.