Overlapping antisense transcription in the human genome
Abstract
Accumulating evidence indicates an important role for non-coding RNA molecules in eukaryotic cell regulation. A small number of coding and non-coding overlapping antisense transcripts (OATs) in eukaryotes have been reported, some of which regulate expression of the corresponding sense transcript. The prevalence of this phenomenon is unknown, but there may be an enrichment of such transcripts at imprinted gene loci. Taking a bioinformatics approach, we systematically searched a human mRNA database (RefSeq) for complementary regions that might facilitate pairing with other transcripts. We report 56 pairs of overlapping transcripts, in which each member of the pair is transcribed from the same locus. This allows us to make an estimate of 1000 for the minimum number of such transcript pairs in the entire human genome. This is a surprisingly large number of overlapping gene pairs and, clearly, some of the overlaps may not be functionally significant. Nonetheless, this may indicate an important general role for overlapping antisense control in gene regulation. EST databases were also investigated in order to address the prevalence of cases of imprinted genes with associated non-coding overlapping, antisense transcripts. However, EST databases were found to be completely inappropriate for this purpose. Copyright © 2002 John Wiley & Sons, Ltd.
Introduction
There is accumulating evidence that a large number of structurally and functionally diverse non-coding RNA (ncRNA) molecules are produced in the eukaryotic cell (Eddy, 2001; Mattick, 2001). The hitherto unsuspected complexity of RNA-based gene regulatory mechanisms presents a considerable technical challenge to both bioinformaticists and molecular biologists. Most of the transcripts do not currently have well-defined structural features and may not be represented, or may be dismissed as cloning artifacts, in gene expression libraries. For example, functional ncRNA occurs in sizes ranging from 21 nucleotides for ‘small temporal RNA’ (stRNA) and double-stranded silencing RNA (siRNA) (Pasquinelli et al., 2000; Harborth et al., 2001), to greater than 40 kb for overlapping antisense or intergenic transcripts associated with chromatin remodeling at imprinted gene loci such as IGF2R and XIST (Wutz et al., 1997; Lee et al., 1999), and at developmentally regulated, non-imprinted loci such as α globin (Gribnau et al., 2000). Moreover, it is unclear whether a low level of transcription of certain ncRNAs is functionally significant, or whether it merely represents ‘illegitimate’ or ‘leaky’ transcription from cryptic promoters.
These difficulties notwithstanding, evidence that ncRNAs make a significant contribution to eukaryotic cell function comes from a variety of sources. Established work indicates that, in addition to functional intronic RNA, ribosomal RNA (rRNA), transfer RNA (tRNA), and the 5′ and 3′ untranslated regions (UTR) of messenger RNA (mRNA), short ncRNAs are also integral components of major nuclear catalytic complexes, for example, the small nuclear RNAs (snRNAs) of the spliceosome (Valadkhan and Manley, 2001), and telomerase RNA (Lukowiak et al., 2001). In addition, a wide variety of transcriptional and translational regulatory mechanisms have been either described or proposed that involve the base-pairing of complementary RNA molecules produced either in cis or in trans. These include the small nucleolar RNAs (snoRNAs), which modify rRNA and snRNAs (Kiss, 2001), the ‘microRNAs’ (miRNAs) including stRNA and siRNA (Eddy, 2001), and overlapping antisense transcripts (OATs) produced in cis at protein-coding gene loci in mammals (Kumar and Carmichael, 1998; Vanhée-Brossollet and Vaquero, 1998).
Imprinted genes, which are expressed from only one of the parental alleles during mammalian development, comprise a functionally diverse family of developmentally regulated genes with unusual genomic features, such as associated tandem repeats (Neumann et al., 1995) and reduced intronic content (McVean et al., 1996). In addition, there may be an enrichment of OATs at imprinted loci (Moore, 2001). Such imprinted, antisense transcripts may be functionally significant because many are expressed at high levels and are associated with genomic regions implicated in regulating the imprinting mechanism (Moore et al., 1997; Sleutels et al., 2000). However, there are also examples of OATs at non-imprinted loci (Vanhée-Brossollet and Vaquero, 1998). Moreover, the apparent enrichment of such transcripts at imprinted loci may reflect an ascertainment bias, because of the intensive study of the genomic organization and allele-specific expression patterns of these genes relative to non-imprinted genes.
In order to address the question of the functional significance of OATs in the human genome, we sought to estimate the frequency of their occurrence and to delineate their genomic structures through bioinformatics, by using BLASTN to search for sequence complementarities between transcribed gene sequences in the public databases. We were able to place a lower boundary on this estimate of approximately one thousand OATs in the human genome.
Materials and methods
The RefSeq database (Pruitt et al., 2001), which contains annotated mRNA sequences for 11 015 different human genes (at January 2001), was used. These are high quality gene predictions that use a combination of the scientific literature, expressed sequence tag (EST) sequences and automatic predictions of the locations of introns and exons. We downloaded the complete processed mRNA sequences for all genes ( ftp://ftp.ncbi.nlm.nih.gov/refseq/ ). These sequences include the coding regions as well as the 5′ and 3′ untranslated regions (UTRs) of each gene. The BLASTN program (version 2.0.13, Altschul et al., 1990) was then used to compare each sequence in RefSeq to the complementary strand of all the remaining genes. This locates pairs of genes that have, in principle, the ability to form stretches of double-stranded RNA. The threshold E-value was set to 10−8 to exclude weak matches. This yielded a collection of 1221 high scoring pairs (HSPs). These included matches due to the presence of repeated sequences (e.g. ALU repeats in the UTRs), which were filtered manually using Repeatmasker (A.F.A. Smith and P. Green, RepeatMasker at http://ftp.genome.washington.edu/RM/RepeatMasker.htmlSmith and Green, 27). The remaining pairs of sequences were then checked using the Locuslink records from RefSeq and the UCSC human genome browser ( http://genome.ucsc.edu/ ) to locate pairs of overlapping genes that map to the same chromosomal location.
In a second series of experiments, the sequences of known imprinted genes from human and mouse were examined for complementary matches against corresponding databases of EST sequences using the Gene2est server at http://www.woody.embl-heidelberg.de/gene2est (Gemund et al., 2001). The list of mouse and human imprinted genes was taken from the Genomic Imprinting Website ( http://www.geneimprint.com/ ). The Gene2est server produces a BLAST output, which was imported into Artemis for visualization of results (Rutherford et al., 2000). In order to check the validity of EST ‘hits’ to the complementary strand, mouse RefSeq was blasted against mouse EST sequences from Genbank 124 (June 2001).
Results
The initial 1221 HSPs from the BLASTN searches were taken and reduced to 56 pairs of overlapping genes as described in Materials and Methods (Table 1). As expected under an assumption of random distribution, a large proportion of the transcripts map to the two largest chromosomes, 1 and 2. The majority of overlaps are between the 3′ UTRs of the transcripts (Table 2), with a smaller number located in the 5′ UTRs, or between the 5′ UTR of one transcript and the 3′ UTR of another. The overlaps typically extend over 50 – 200 nucleotides, in some cases involving the coding region of one transcript (Table 1 and Figure 1).

Schematic diagrams of some overlapping transcripts found in our search. Lines indicate introns, and boxes exons. a) Overlap between FMO1 and a predicted RefSeq entry on chromosome 1. b) Overlap between the last exon and 3′ UTR of MSH6 and the 3′ UTR of VIT1 on the short arm of chromosome 2. c) Overlap between the 3′ UTR of FGF2 and the coding region and 3′ UTR of its previously reported antisense on the long arm of chromosome 4. d) Overlap between two reviewed RefSeq sequence entries, DOM3Z and STK19, which map to 6p21
Name |
Map Position |
Overlap |
Size of Overlap in Nucleotides |
Comment |
---|---|---|---|---|
MUF1 * RAD54L |
1p33 |
3′ |
72 |
–leucine repeat –role in DNA recombination and repair |
FLJ20580 ** CPT2 |
1p32.3 |
3′ |
94 |
–function unknown –enzyme of long-chain fatty acid oxidation. |
ATP1B1 1*NME7 |
1q24.2 |
3′ |
184 |
–beta 1 subunit of Na+/K+-ATPase –kinase, may be involved in the synthesis of CTP, GTP, and UTP |
**FMO1 PRO1257 |
1q24.2 |
cds/5′ |
92 |
–found in fetal liver, NADPH-dependent flavoenzymes –function unknown |
* TPR **PRG4 |
1q31.1 |
3′ |
199 |
–provisional, translocated promoter region (to activated MET oncogene) –reviewed, Megakaryocyte stimulating factor secreted proteoglycan, |
LOC51611 FLJ20139 |
1p21.1 |
cds + 3′/5′ + cds |
429 |
–function unknown –function unknown |
*OPN3 *KMO |
1q43 |
cds + 3′/3′ |
1095 |
–extra-retinal opsin, G-protein linked receptor –catalyzes the hydroxylation of kynurenine to 3-hydroxykynurenine |
KIAA0764 2*FLJ10624 |
2p23.2 |
5′ |
112 |
–function unknown –solute carrier family 4 (anion exchanger), member 1, adapter protein |
CRIPT LOC51088 3 |
2p21 |
all/3′ |
1063 |
–postsynaptic protein –lymphocyte activation-associated protein. |
*MSH6 VIT1 |
2p16 |
cds + 3′/ 3′ |
177 |
–G/T mismatch-binding protein –function unknown |
*VRK2 FLJ10335 |
2p16.3 |
cds + 3′/ cds + 3′ |
457 |
–serine/threonine kinase –function unknown |
**MTHFD2 *NBC4 |
2p12 |
cds + 3′ / 3′ |
92/92 + 76/76 |
–intracellular pH regulation and sodium bicarbonate transport –involved in initiation of mitochondrial protein synthesis |
*SSB 4HSPC133 |
2q31.1 |
cds / 3′ |
63 |
–binds and stabilizes histone mRNA, acts in maturation of tRNAs –function unknown |
PRO2900 *HDLBP |
2q37.3 |
3′ |
221 |
–mediates cholesterol removal –function unknown |
**RAF1 5HSPC070 |
3p25.2 |
3′ |
105 |
–kinase - phosphorylates substrates involved in regulating apoptosis, –function unknown |
**OGG1vi 6** CAMK1 |
3p25.3 |
3′/3′ |
56/56 |
–involved in base excision DNA repair and removal of 8-oxyguanine –involved in Ca(2+)-regulated processes; member of kinase family |
HEMK *LOC51161 |
3p21.31 |
5′ |
64 |
–function unknown –g20 protein |
**FGF2 * NUDT6 |
4q26 |
3′ / cds + 3′ |
583/587 + 56/56 |
–mitogenic, angiogenic, and neurotrophic factor –FGF-AS |
**POLR2B *IGFBP7 |
4q12 |
3′ |
91 |
–subunit of RNA polymerase –may bind to and modulate insulin-like growth factor activity |
*ARTS-1 *CAST |
5q15 |
3′ |
132 |
–type 1 tumor necrosis factor receptor –an inhibitor protein of calpain |
*LOC51306 * PKD2L2 |
5q31.1 |
cds + 3′/cds + 3′ |
263 |
–potential nuclear protein -GAP-like protein –member of the polycystin family; expressed in testis |
**DOM-3 **STK19 |
6p21.32 |
5′ + cds/ 5′ |
124 |
–function unknown, ubiquitous expression and conservation in eukaryotes, –serine/threonine kinase - function unknown; possibly involved in transcriptional regulation. |
*PPP2R5D *MEA |
6p21.1 |
3′ |
111 |
–regulatory subunit B of protein phosphatase 2, delta isoform –male-enhanced antigen; may play a role in mammalian spermatogenesis and/or testis development |
LOC51106 *TIAM2 |
6q25.3 |
cds / 3′ |
300 201 |
–function unknown –guanine nucleotide exchange factor |
*ASK *LOC55972 |
7q21.12 |
5′ |
102 |
–activator of CDC7 S phase kinase; required for the G1/S transition –mitochondrial carrier family protein |
**AP4M1 *TAF2E 7 |
7q22.1 |
3′ |
101 |
–involved in the recognition and sorting of cargo from the trans-golgi network to the endosomal-lysosomal system. –component of the TFIID complex; interacts with general transcription factors |
**KDELR2 FLJ20306 |
7p22.1 |
5′ + cds / cds |
105 |
–retrieves proteins from the Golgi for retrograde transport to the ER –function unknown |
*MEST *COPG2 |
7q22.1 |
3′ |
48 |
–mesoderm specific protein –protein related to gamma-COP; may play a role in cellular vesicle traffic |
FLJ20530 *CCNE2 |
8q22.1 |
3′ |
48 |
–function unknown –cyclin E2 - functions specifically during the G1-phase of the cell cycle |
**NPR2 *SMP-1 |
9p13.3 |
3′ |
309 |
–guanylate cyclase activity –sperm associated antigen 8 |
**TESK1 *CD72 |
9p13.3 |
3′ |
54 |
–serine/threonine kinase - testicular germ cell-specific expression and developmental pattern of expression in mouse. –cell surface protein expressed exclusively on B cells; may be involved in control of B cell proliferation |
HT009 **IDI1 |
10p15.3 |
3′/ 5′ + cds |
173 |
–uncharacterized hypothalamus protein –involved if formation of cholesterol. |
*RIG *DKK3 |
11p15.2 |
3′ |
295 |
–regulated in glioma –dickkopf (Xenopus laevis) homolog 3; related to proteins that antagonize Wnt signaling |
FLJ20539 *PHT2 |
11q12.3 |
3′ |
72 |
–function unknown –member of the proton-dependent oligopeptide transport family |
*RAD9 *PPPICA |
11q13.3 |
3′ |
205 |
–may function as a cell cycle checkpoint protein –regulates mitosis, putative tumor suppressor |
*IL18BP *NUMA1 |
11q13.4 |
3′ |
668 |
–inhibitor of the early Th1 cytokine response; –structural component of the nucleus; predicted role in nuclear reassembly |
*PAF65A FLJ11136 8 |
11q13.1 |
cds / 3′ cds 5′ |
780/781 134/134 |
–regulates transcription, cell cycle progression, and differentiation –function unknown |
**APAF1 LOC56899 |
12q23.1 |
3′ |
57 |
–functions in the mitochondrial apoptotic pathway that leads to caspase 9 dependent activation of caspase 3 –putative 47 kDa protein |
**LRMP PRO1438 |
12p12.1 |
5′ / 3′ |
249 |
–protein is expressed in a developmentally regulated manner in lymphoid cell lines and tissues –function unknown |
KIAA0670 FLJ20671 |
14q11.2 |
5′ |
66 |
–function unknown, putative DNA binding motif –function unknown |
*CIDEB *BLTR2 |
14q11.2 |
5′ / cds |
743 |
–contains the regulatory CIDE-N domain also in apoptotic pathway proteins CAD and ICAD –G protein-coupled receptor, inhibits adenylyl-cyclase, modulates intracellular calcium flux and chemotaxis |
**ARG2 *VTI2 |
14q24.1 |
3′/ cds 3′ |
330 |
–catalyzes the hydrolysis of arginine to ornithine and urea - located in the mitochondria and expressed in extra-hepatic tissues, especially kidney. –(v-SNARE); functions in vesicle transport pathways |
*TK2 FLJ20006 |
16q22.1 |
5′ + cds/cds |
39 |
–generates thymidylate for DNA synthesis –function unknown |
MDDX28 FLJ20399 |
16q22.1 |
5′ |
56 |
–mitochondrial DEAD-box polypeptide 28 –contains a double-stranded RNA binding domain |
LOC51031 FLJ10581 |
17p13.3 |
5′ + cds |
587 |
–function unknown –member of the RNA methyltransferase family |
FLJ10534 *SRR |
17p13.3 |
cds + 3′/3′ |
379/397 142/142 80/80 |
–function unknown. –catalyses the synthesis of D-serine from L-serine |
*HUMGT198A *MLX |
17q21.2 |
3′ |
319 |
–TBP-1 interacting protein –Interacts with Mad and represses transcription |
FLJ10055 KIAA1001 |
17q24.2 |
3′ |
539 |
–function unknown, contains 2 WD domains –function unknown |
*MIC1 **NPC1 |
18q11.2 |
3′ |
151 |
–colon cancer associated protein –integral membrane protein, possible role in cholesterol transport |
*ERCC-1 *ASE-1 |
19q13.3 |
3′ |
112 |
–endonuclease –may function in rDNA transcription |
*COL9A3 *TCFL5 |
20q13.33 |
3′ |
42 |
–alpha 3 subunit of type IX collagen; may connect fibrils to other matrix elements –may regulate transcription associated with growth and differentiation |
**PPGB *PLTP |
20q13.11 |
3′ |
58 |
–encodes a glycoprotein - forms a protective complex with beta-galactosidase and neuraminidase for stability and activity. –phospholipid transfer protein; has roles in phospholipid transport |
FLJ10508 * MCM3AP |
21q22.3 |
3′ + cds /cds |
194/194 159/159 |
–function unknown –protein binds to the replication protein MCM3 |
*BK126B4.1 CGI-96 |
22q13.2 |
cds + 3′/ 3′ |
435/445 172/178 |
–kraken-like, alpha/beta hydrolase fold –function unknown,RNA recognition motif |
*P2RXL1 SLC7A4 |
22q11.21 |
3′ |
112 |
–member of P2X family of ion channels –NCBI annotation, Cationic amino acid transporter |
*TR **COMT |
22q11.21 |
5′ |
112 |
–mitochondrial thioredoxin reductase –involved in the degradation of catecholamine neurotransmitters and catechol drugs B membrane bound form. |
- 1 5′ UTR of NME7 overlaps with 5′ UTR of BLZF1according to April freeze of the human draft sequence.
- 2 3′ of KIAA0764 overlaps with a putative ATP binding protein (NTPBP).
- 3 According to April freeze of the human draft sequence, a full overlap exists with matching exons.
- 4 SSB originates and terminates within intron of sarcosin, a provisional sequence.
- 5 According to the April freeze of the human draft sequence, HSPC070 maps within an intron of RAF-1 along with two other genes B PPARG, which regulates adipocyte and macrophage gene expression and differentiation, and MGC2776, which encodes a hypothetical protein.
- 6 Eight alternative transcripts of OGG1, all of which overlap CAMK1. Three of the transcripts overlap the 3′ UTR. The coding region of the other 5 transcripts overlap with CAMK1 coding sequence.
- 7 AP4M1 appears to originate and terminate within an intron of FLJ10925, a predicted RefSeq sequence. Coding sequence of TAF2E is also complementary to an intron of FLJ10925.
- 8 PAF65A, according to the April freeze of the human draft sequence, originates and terminates within an intron of a reviewed entry, STX5A.
Type of overlap |
Number of overlaps |
---|---|
3′/3′ |
37 |
5′/5′ |
11 |
3′/5′ |
5 |
The transcripts identified in our search encode proteins with heterogeneous functions in DNA synthesis, cell cycle control and developmental regulation. This diversity might suggest that the occurrence of a DNA sequence overlap between pairs of protein-coding genes is incidental to their genomic location and structure, and of no mechanistic significance. However, some of the overlaps detected by our search have previously been reported in the literature, in either human or other species, and include functional studies that support their mechanistic significance. For example, a 1.5 kb OAT to basic fibroblast growth factor (bFGF, FGF2) has been reported in the oocytes of Xenopus laevis, and the human homologue has been cloned and mapped to the long arm of chromosome 4. In X. laevis, the region of complementarity extends through both the coding region and the 3′ UTR of FGF2, whereas, in the human and rat homologues, complementarity extends only to the 3′ UTR (Figure 1). Expression levels of both the sense and antisense transcripts have been studied to investigate the possibility of antisense regulation of the sense transcript (Li et al., 1996). The developmental pattern of expression of the OAT was found to be inversely correlated to the sense transcript in developing rat brain. Expression was also found to be age-dependent with sense expression increasing postnatally and antisense expression decreasing (Li et al., 1996). Subsequently, it was shown that FGF2 protein levels are directly influenced by the level of the OAT in mammalian cells (Li et al., 2000), suggesting post-transcriptional regulation of FGF2 by the OAT. It has also been shown that this OAT encodes a functional protein with MutT-related enzymatic activity in the rat, and it was noted that the human homologue also contains an open reading frame (Li et al 1997).
Intron 3 of the mouse thymidine kinase (tk) gene has been reported to contain an antisense promoter and the associated OAT is thought to regulate expression of the TK protein-encoding sense transcript in mouse fibroblasts (Sutterluety et al., 1998). This salvage pathway enzyme is expressed at low levels in resting mammalian cells but levels increase dramatically when cells enter S phase. A well-characterized transcriptional regulation is involved, and a post-transcriptional mechanism is also suspected. The correlation of TK protein repression with OAT expression supports a role for the OAT in regulating TK expression. The 5′ UTR and part of the coding sequence of the human TK homologue found in RefSeq are complementary to a predicted gene of unknown function indicating the existence of a human homologue of the mouse OAT.
We also found an overlap of 177 nucleotides between the 3′ UTRs of MSH6 and VIT1 at 2p16 (Figure 1), as previously reported. It was suggested that the overlap allows regulation of MSH6 by VIT1 (Le Poole et al., 2000).
Imprinted gene transcripts coding for functional proteins are found in the RefSeq database, but non-coding OATs associated with them are not. For example, the human COPG2 gene has a non-coding OAT at the 3′ end, which was not found in our search. However, COPG2 also overlaps with the imprinted, protein-coding MEST gene over 52 nucleotides at their 3′ ends, as previously reported (Blagitko et al., 1999), and as successfully identified by our search.
In an attempt to identify novel non-coding OATs at imprinted gene loci, a second set of experiments involving a BLASTN search of all known mouse imprinted genes against mouse EST databases found that the majority of imprinted genes had ESTs aligned to both strands. The reverse complement of all gene transcripts in the mouse RefSeq database was also used in a BLASTN search against the same database of mouse ESTs to determine whether the high number of ‘hits’ at imprinted loci occurred as an artefact of the EST database, due to submission of DNA sequence from both strands of cloned, double-stranded cDNA. Out of 7340 entries in mouse RefSeq, 6489 transcripts received hits to the complementary strand. The correct transcriptional orientation of the ESTs aligned to their respective genomic regions could not be assigned unambiguously and therefore the use of EST databases to search for non-coding OATs is unreliable. However, it may be possible in the future to use EST databases consisting exclusively of directionally cloned and sequenced cDNAs to produce an accurate estimate of the frequency of non-coding OATs.
In a further experiment, we assessed the representation of previously confirmed OATs at imprinted mouse and human gene loci in the public EST databases. We found, unsurprisingly, that the databases are biased towards highly expressed transcripts, which is problematic because some OATs are expressed at low levels and may be tissue-specific (Moore et al., 1997). The mouse insulin-like growth factor 2 (Igf2) gene is an extensively studied imprinted gene with an OAT (Igf2as) at the 5′ end (Moore et al., 1997; Okutsu et al., 2000). This OAT was not detected in our searches, probably due to its low expression level (Figure 2). Moreover, the 5′ ends of genes are underrepresented in EST databases because reverse transcription of mRNA is frequently initiated from the 3′ end using a poly(T) primer. Therefore, in BLASTN searches against ESTs, more ‘hits’ are expected at the 3′ end of the gene (Figure 2). It is also evident that there are many ‘hits’ on the opposite DNA strand to that predicted from the structure of the Igf2as gene, further undermining the reliability of such EST-based searches.

Schematic of mouse Igf2 transcripts. Exons are shown as boxes and ESTs aligning to both strands are marked by arrows. EST datasets are biased towards 3′ ends of genes and their orientation with respect to the genomic locus is uncertain. Such bias inhibited the unambiguous validation of non-coding antisense transcripts. Igf2 is one of the most extensively studied imprinted genes with a well-characterised antisense transcript at the 5′ end which was not detected by a search of EST databases. More than 300 ESTs matched both strands at the 3′ end of the gene, as indicated by the thickness of the arrows, with slightly more aligning to the top strand (coding for Igf2) than the lower strand. Relatively few aligned to the 5′ end of the gene
Discussion
We found 56 pairs of overlapping transcripts among the 11 015 protein coding transcripts in RefSeq. On the conservative assumption that RefSeq contains one quarter of all protein coding transcripts, we can estimate that there are 4X4X56 = 896 OAT pairs in the human genome. However, this is likely to be an underestimate because RefSeq does not contain non-coding transcripts, which occur frequently at imprinted loci, and also at non-imprinted loci, but at an unknown frequency. In this study, we show that EST data are unsuitable for investigating non-coding OATs due to the biased nature of the current databases. An accurate estimation of OAT pairs consisting of one or two non-coding transcripts will require either laboratory-based approaches or customized gene expression databases that circumvent the problems associated with the current EST databases.
During the preparation of this manuscript a list of potential antisense transcripts in the human genome was reported by Lehner et al. (2002). They used RefSeq, as we did, but also used a compilation of vertebrate mRNAs extracted from the EMBL nucleotide sequence database. They reported a total of 87 pairs of genes, 45 of which are in common with our list. Of the 42 gene pairs, reported by Lehner et al. (2002) that we did not find, 18 include a sequence from the EMBL compilation that is not represented in RefSeq, and which we did not include in our analysis. These overlapping pairs are of variable and unknown validity but do include some biologically interesting genes. The remaining gene pairs were from a more recent version of RefSeq than that used in our analysis. We report the following 11 unpublished OATs, excluded by Lehner et al. (2002) due to the presence of repeat sequences: TPR/PRG4, LOC51611/FLJ20139, KIAA0764/FLJ10624, CRIPT/LOC51088, VRK2/FLJ10335, HT009/IDI1, APAF1/LOC56899, MDDX28/FLJ20399, LOC51031/FLJ10581, COL9A3/TCFL5, FLJ10508/MCM3AP. However, in all of these pairs, the repeats are not the basis of the complementary pairing between the transcripts. Therefore, as the pairs are transcribed from the same locus their inclusion is valid.
The functional significance of the OATs described herein is largely unknown. However, some of the pairs that we found have been described previously and have been studied functionally (Le Poole et al., 2000; Li et al., 1996, 1997; Sutterluety et al., 1998). Twenty three of the 56 OAT pairs that we describe involve transcripts containing an open reading frame encoding a protein of unknown function. Further characterization of the transcriptome and proteome is required to test the functionality of such pairs. Expression levels of OATs might be expected to be inversely proportional to one another, as is the case for the FGF2 locus. Such further studies may clarify the involvement of such overlapping transcripts in gene regulation. Although we cannot exclude the possibility that some of the overlaps may be incidental and of no functional significance, the existence of double-stranded RNA specific proteins supports the possibility that OATs constitute part of a significant gene regulatory mechanism. For example, DRADA, a member of the dsRNA-specific adenosine deaminase family of modifying enzymes, is a ubiquitously expressed nuclear enzyme capable of converting adenosine residues in dsRNA molecules to inosines, thereby destabilising the molecule (Kim and Nishikura, 1993). OATs forming dsRNA molecules could also be targets for dsRNA-specific RNases leading to mRNA degradation.
Functionally significant overlapping antisense transcripts have been reported in prokaryotic cells and are implicated in post-transcriptional regulatory mechanisms (Wagner et al., 2002). Regulatory OATs are also present in eukaryotes, indicating a widespread role for antisense mediated gene regulation (Vanhee-Brossollet and Vaquero, 1998). With the emergence of complete genome sequence databases, a comparative analysis to test for interspecies conservation of OAT pairs could offer further insights into the prevalence and functional significance of antisense transcription. For example, the structure of the FGF2 gene coding transcript and its corresponding OAT are conserved between human, rat, chicken and frog (Knee and Murphy, 1997). This example provides a starting point upon which to build a comprehensive database of OATs. Moreover, as the annotation of genomes becomes more complete, and methods to detect and characterize non-coding transcripts improve, a more complete database of OATs comprising both coding and validated non-coding OATs may be compiled.
Acknowledgements
This work was funded as part of the Biopharmaceutical Sciences Network funded by the Irish Higher Education Authority. The authors wish to thank Ewan Birney and Toby Gibson for advice at an early stage of this project.