A high resolution map of the Arabidopsis thaliana developmental transcriptome based on RNA-seq profiling
Summary
Arabidopsis thaliana is a long established model species for plant molecular biology, genetics and genomics, and studies of A. thaliana gene function provide the basis for formulating hypotheses and designing experiments involving other plants, including economically important species. A comprehensive understanding of the A. thaliana genome and a detailed and accurate understanding of the expression of its associated genes is therefore of great importance for both fundamental research and practical applications. Such goal is reliant on the development of new genetic and genomic resources, involving new methods of data acquisition and analysis. We present here the genome-wide analysis of A. thaliana gene expression profiles across different organs and developmental stages using high-throughput transcriptome sequencing. The expression of 25 706 protein-coding genes, as well as their stability and their spatiotemporal specificity, was assessed in 79 organs and developmental stages. A search for alternative splicing events identified 37 873 previously unreported splice junctions, approximately 30% of them occurred in intergenic regions. These potentially represent novel spliced genes that are not included in the TAIR10 database. These data are housed in an open-access web-based database, TraVA (Transcriptome Variation Analysis, http://travadb.org/), which allows visualization and analysis of gene expression profiles and differential gene expression between organs and developmental stages.
Introduction
The genome of Arabidopsis thaliana was the first of any plant to be sequenced (Arabidopsis Genome Initiative, 2000), and this important milestone laid the foundation for the Arabidopsis Information Resource (TAIR; http://www.arabidopsis.org), a database that integrates information such as A. thaliana gene functions, mutant phenotypes and expression profiles. In addition, the 1001 genomes project (http://1001genomes.org/) seeks to characterize genetic variation in 1001 A. thaliana accessions. Consequently, the A. thaliana genome has become the gold standard annotated reference plant genome (Berardini et al., 2015), facilitating research in other plants, including those that are commercially important (Rensink and Buell, 2004). However, the A. thaliana genome is far from being fully annotated and characterized, particularly with regard to gene function. Only 9502 out of 33 323 predicted A. thaliana genes have gene names, not only TAIR identifiers (e.g. LEAFY for AT5G61850, or AGL14 for AT4G11880), which are usually an indication that the gene has a known mutant phenotype or is a member of a known gene family, and that its function has been assessed to some degree. Even using automatic gene ontology (GO) annotation leaves ~30% of the A. thaliana genes without annotation. Moreover, only high-level GO terms (such as ‘cellular process’), which do not provide information about the involvement of a gene in a particular process, are available for most of the annotated genes. One of useful starting points for functional genomic studies involves the characterization of gene expression profiles. In 2005, an expression map for different A. thaliana organs and developmental stages was published (Schmid et al., 2005) and later integrated into an expression browser (Winter et al., 2007). This map was based on microarray gene expression analysis that, while a breakthrough technology at the time, suffers from several limitations, such as the need for large amounts of RNA (or its amplification) and a low dynamic range (i.e. genes that are expressed at very high or very low levels cannot be reliably quantified) (Shendure, 2008). This is a particular issue with transcription factor encoding genes, whose expression is often tissue specific and at low levels, such as the WUS gene, which regulates stem cell maintenance in the shoot apical meristem (SAM). The expression profile of WUS in an atlas based on microarrays (Schmid et al., 2005) (http://bar.utoronto.ca/efp2/Arabidopsis/Arabidopsis_eFPBrowser2.html) is clearly evident in anthers, but is close to background levels in the SAM. However, in situ hybridization and reporter gene data indicate expression in both the SAM and anthers (Deyhle et al., 2007). An alternative approach to evaluation of gene expression, RNA-seq, does not have these limitations as it allows the analysis of very low amounts of RNA and has a very broad dynamic range: the ratio between the maximum and minimum expression level is 9500 for RNA-seq and 44 for microarrays (Wang et al., 2009). Furthermore, RNA-seq is an open-architecture platform and so is not confined to the analysis of known transcript variants, and allows the identification of new splicing events and new genes.
Some detailed transcriptomic maps based on RNA-seq data have been constructed for the model animal species, Drosophila melanogaster (Graveley et al., 2011), mouse (Mus musculus) (Pervouchine et al., 2015), rat (Rattus norvegicus) (Yu et al., 2014) as well as human (Homo sapiens) (Mele et al., 2015). In the case of plants, only certain developmental stages, organs or conditions have been well characterized (Li et al., 2010; Wuest et al., 2010; Loraine et al., 2013), and global high resolution transcriptome analysis is lacking. Here, we report the analysis of gene expression levels in 79 A. thaliana organs and developmental stages. The samples were selected in order to maximize the representation of different organs and stages, and to provide insights into the dynamics of gene expression in the most important processes in the life of the plant: transition to flowering, flower development, ovule development, with special focus on organs and stages not sampled in microarray-based transcriptome map (Schmid et al., 2005), for example, detailed shoot apical/inflorescence meristem series and leaf development series. All samples described were also studied using scanning electron and light microscopy (Table S1 and Data S1). The total dataset includes ~4.3 billion reads, thus giving better resolution and depth than previous studies. This allowed for an accurate estimation of such parameters as the number of expressed genes at every stage and organ, for revealing the most stably expressed groups of genes, as well as those that have restricted expression patterns and the characterization of previously undescribed splicing events. The results are summarized in a database, TraVA (Transcriptome Variation Analysis): http://travadb.org/. This database includes a number of tools for visualization of absolute and relative gene expression and the analysis of differential gene expression between stages and organs, using two of the most reliable and widely accepted statistical approaches, DESeq/DESeq2 and BaySeq (Rapaport et al., 2013; Soneson and Delorenzi, 2013). This database allows researchers to identify differences in gene expression and the statistical significance of those differences, a feature that is not available in existing A. thaliana databases.
Results and discussion
Study design
To construct a comprehensive, high resolution transcriptome map, 79 samples were collected in two biological replicates each from A. thaliana organs at different developmental stages. Samples included parts of the roots, leaves, floral organs and whole flowers, seeds, siliques and stems. Flowers, seeds and leaves were organized into a time series. All samples are described in Table S1 and Data S1 and referred to hereafter as the ‘Map dataset’. To explore the comprehensiveness of the data sets, we also collected samples of leaves from plants exposed to different abiotic stresses as a control dataset (referred to as the ‘Stress dataset’). These included a time course of a cold treatment (1, 3, 6, 12, and 24 h at 4°C), a heat treatment (1, 3, 6, 12, and 24 h at 42°C) and wounding (1, 3, 6, 12, 24 and 48 h after wounding), each of which was performed with two biological replicates.
Transcriptome sequencing
For the Map dataset, 22.7 million uniquely mapped high-quality reads were obtained on average for each sample giving a total of 3.6 billion (Table S2). For the Stress dataset, 18.9 million uniquely mapped high-quality reads were obtained on average for each sample, giving a total of 606 million reads. Since a PolyA+ selection protocol was used, the main data analyses were conducted on polyadenylated mRNAs and noncoding RNAs. Pearson r2 correlation values for all replicates were between 0.83 and 1.0, with a mean value of 0.97 (median 0.98) (Table S3), and a clustering tree of the replicates also indicated consistency of the data (Figure S1).
A hierarchical clustering tree of the samples reflected an organ-specific and age-specific structure, as the different samples series organized into distinct clades and the parts of the plant that contain meristems of various types also clustered together, as did the green parts of the plant. The most divergent samples were the mature pollen and senescent organs (Figure 1).

Clustering of samples.
Hierarchical clustering of samples as represented by a clustering tree. Distance between samples is measured as 1 − Pearson squared correlation coefficient. Groups of similar samples are indicated.
Expressed genes
The annotation of the A. thaliana genome (TAIR10 https://www.arabidopsis.org/) contains 33 323 genes, of which 27 201 are defined as protein coding. Across all samples we identified 25 706 in total, and 24 621 protein-encoding genes were observed to be expressed in at least one sample (Table S4). The minimum number of normalized read counts in each of two replicates of any particular sample to ensure strong support was set as 16 (Su et al., 2014). In total, 10 738 genes (10 654 protein-coding) were expressed in all samples (Table S5), and the lowest number of expressed genes (15 525) was observed in the M1 sample (SAM at vegetative stage) while the greatest number (19 613) was in the F3 sample (flower) (Table S6).
To determine whether the Map dataset represented the majority of the expressed genes, we also evaluated gene expression following exposure of the plant to various abiotic stress conditions. Only 96 (26 protein-coding) genes were expressed only in the Stress dataset and not in the Map dataset, suggesting that the latter contained almost all expressed genes (Table S7). The expression of 7617 genes was not detected neither in the Map nor in the Stress dataset and 2580 of these are annotated as mRNA-coding (Table S8). GO enrichment analysis, as well as overrepresentation analysis of other terms from different databases (INTERPRO, KEGG, SMART, https://www.ebi.ac.uk/interpro/, http://www.genome.jp/kegg/, http://smart.embl-heidelberg.de) was performed for all the non-expressed genes and separately for the protein-coding genes. The enriched terms in the list of mRNA-coding genes that were not expressed included ‘defense response to fungus’, ‘killing of cells of another organism’ and ‘RNA-directed DNA polymerase activity’ (Table S9). We note that our control Stress dataset did not contain expression data associated with imposed biotic stress, which likely explains the absence of expression of genes specific for these conditions. Compared with previous high-throughput study of gene expression based on microarrays (Schmid et al., 2005), out of 21 150 genes 94% are expressed in our dataset, and for 5877 genes expression is observed in our dataset but not in microarray-based (Figure S2).
We then assessed the number of samples in which each gene was expressed (Figure 2a). Most of the protein-coding genes tended to be expressed in all or almost all samples (15 296 genes were expressed in 65 or more samples), while some genes were expressed in few samples (4920 genes in 1–15 samples), and fewer genes were expressed in more than 15 and less than 65 samples (4405). We also investigated whether there was a correlation between the number of samples in which a gene was expressed and the expression level (Figure 2b). Mean and median expression levels across all samples were found to be higher for more widely expressed genes (i.e. those expressed in more samples). For maximum and minimum expression levels, the most widely expressed genes also had a greater expression level but the trend was not as prominent for these genes. Similar patterns have previously been observed in a microarray-based transcriptome analysis (Schmid et al., 2005).

Overall expression characteristics.
(a) Distribution of genes by number of samples in which each gene was expressed.
(b) Relationship between gene expression level and number of samples in which a gene is expressed for minimum, mean, median and maximum expression levels of each gene.
(c) Distribution of differences from mean expression level (Z-score) for each gene for selected samples.
RNA-seq analysis based on current A. thaliana gene models (TAIR10) has a known limitation in the cases where two neighboring genes have identical nucleotide sequences (such as AT3G30385 and AT3G30387, AT3G28290 and AT3G28300, or AT5G50580 and AT5G50680) since such genes will be absent from the list of uniquely mapped reads. In order to take these genes into account, we also analyzed the expression data allowing non-unique mapping, which revealed an additional 334 (213 protein-coding) genes expressed in at least one sample. Of these, 159 (143 protein-coding) were expressed in all samples (Table S10 and Figure S3).
Uniformity of gene expression across samples
The Z-score represents the difference between the total read count for a gene in a sample and the mean total read count for that gene across all samples normalized to the standard deviation of the gene's total read counts. This evaluation allows for a comparison of the distribution of gene expression levels in certain samples with an overall ‘mean’ distribution. We transformed normalized total read counts into Z-score values and we presented the Z-score value distribution in histograms, which showed differences in their shapes between samples (Figures 2c and S4). For example, some samples were shifted to the right side (young seeds 1, ovules from 6th and 7th flowers before pollination), and others to the left (opened anthers, mature anthers before opening), see Figure S4. This result primarily reflects high expression levels of genes that are specific for these organs, such as ABCG20 or ATEXP24 in the anthers of mature flowers before opening, or a low overall expression level in a particular part of the plant, as was the case with mature yellow seeds. Despite this variation, in general most samples showed a similar distribution in their expression levels.
We next evaluated the ratio of GO terms in each sample. A GO ‘slim’ annotation (i.e. using more broadly defined definitions) was judged not to be appropriate for our analysis due the largeness of terms and relatively few genes in each category, so genes were annotated using the third level of the GO term tree. All samples had a similar ratio of GO terms in the ‘biological processes’, ‘molecular function’ and ‘cellular components’ categories.
The mean expression level and mean number of samples in which each gene was expressed for GO categories at the third level of the GO term tree was also calculated (Table S11). GO terms containing more than 100 genes were present in almost all samples and had high mean levels of expression (GO:0048856 – anatomical structure development, GO:0044424 – intracellular part), while other terms, such as GO:0070505 – pollen coat, appeared only in few samples yet had high expression levels, as would be expected for tissue-specific genes.
Differentially expressed (DE) genes
We analyzed differential gene expression between all possible pairs of samples (3081 comparisons in total) using DESeq (Anders and Huber, 2010). The most similar samples were M1 and M2 (meristems before transition to flowering; 14 DE genes), and the most dissimilar were F3 and SD.d (third flower at the stage of anthesis of first flower and dormant seeds; 15 149 DE genes) (Tables S12 and S13). The number of DE genes was also used as a measure of the distance between samples in a hierarchical clustering analysis. As expected, the grouping of samples based on DE genes is concordant with the structure of the plant: mature and old leaves, flowers, young leaves, siliques, meristems are grouped together (Figure S5).
Differentially expressed scores, defined as the number of paired comparisons in which a gene was DE, were calculated to identify genes that are likely to be involved in organ-specific processes, and those that were ubiquitously expressed (Figure 3a). The highest observed DE score was 2533, out of maximum possible value of 3081, and only 331 genes had a DE score >2300. Notably, most of these genes have gene names (64%, compared with 30% for the whole genome), indicating that they have been the subject of detailed functional analysis, or are members of known gene family; however, it is not unexpected that such genes are prevalent among the organ-specific subset. Most categories enriched in the gene list with a DE score >2300 were referred to as photosynthesis and related metabolism (e.g. GO:0034357 – photosynthetic membrane and GO:0019757 – glucosinolate metabolic process). In contrast, genes with the lowest DE scores (<100) were enriched by such GO categories as ‘RNA processing’, ‘cellular protein localization’ and ‘transport and membrane coat’. The GO enrichment in all the DE gene lists was then analyzed, separately considering either down- or upregulated genes. We identified 1528 GO categories that were enriched more than two-fold. Also, we calculated the enrichment of categories from other databases (see above) and found 1531 categories (Table S14). Among the categories that were enriched in many of the paired comparisons were those related to photosynthesis, reflecting a difference in expression profiles between the green and non-green parts of the plant, as well as terms associated with chromatin and cell division (IPR007125:Histone core, IPR001752:Kinesin, motor region, IPR004367:Cyclin, C-terminal, GO:0000785 – chromatin, GO:0051276 – chromosome organization, GO:0022403 – cell cycle phase), which might reflect differences in expression profiles between rapidly developing and mature organs.

Stability of expression.
(a) Distribution of the differentially expressed (DE) Score (number of pair comparisons in which a gene was DE).
(b) Histogram showing the measure of expression width (Shannon entropy).
(c) Venn diagrams showing the intersections of lists of stably expressed genes in the Map and Map&Stress datasets for three thresholds of gene stability.
Stability and specificity of gene expression
To measure the tissue specificity of gene expression Shannon entropy values were calculated for the expressed genes in the Map dataset (Schug et al., 2005; Lin et al., 2014). Values ranged from 0.0 to 4.57, with low values indicating a narrow pattern of expression and high values denoting ubiquitous expression. Most genes had high entropy values, which is consistent with a wide pattern of expression (Figure 3b). Consistent results were obtained by Li et al., 2012 who used kurtosis analysis to identify genes expressed in two or more, but not all, tissues of A. thaliana and other organisms and showed that the majority of genes are expressed in all tissues (Li et al., 2012). We analyzed GO and other term enrichments for the genes with the lowest and highest Shannon entropy (Table S15), and found that genes with values <0.15 were enriched in ‘cell-cell signaling’, ‘pectinesterase activity’, ‘cell wall’ and ‘endomembrane related’ terms (Table S16), while genes with values >4.53 showed enrichment in ‘nucleic acid transport’, ‘RNA transport’, ‘lipoprotein processes’ and ‘membranes’ (Table S17). When standard deviation of gene expression in the Map dataset divided by mean expression (standard deviation; SD)/mean) was used as a measure of expression stability, similar results for GO enrichment were obtained. Forty-seven genes with an SD/mean ratio <0.2, 339 genes with an SD/mean ratio <0.25 and 970 genes with an SD/mean ratio less <0.3 were identified (Tables S18–S20). Notably, the DE scores for these genes varied from 0 to 477. Based on this analysis, the genes with stable expression were enriched in terms associated with ‘nucleic acid transport’, ‘RNA transport’, ‘lipoprotein’ and ‘membrane-related processes’, which was consistent with processes that are ubiquitous in plant organs and tissues.
We also analyzed combined data from Map and Stress datasets. If using cut-off 0.2, 34 genes were identified as stable in Map&Stress combined dataset (Tables S21–S23), with higher cut-off the number of genes highly increased (274 genes at cut-off 0.25 and 792 at 0.3). Intersection of these two variants of stability assessment revealed 27 most stable genes using cut-off 0.2 (Figure 3c and Table S24). These genes had an SD/mean expression ratio of 0.15–0.2 and DE scores varying from 0 to 117, indicating very small differences in expression between samples. Shannon entropy values for these genes ranged from 4.56 to 4.57, further indicating the uniformity of expression in all samples (Table S24). These genes were associated with various processes, including flowering, lysosome transport, chromosome condensation and stress tolerance, and provided a set of reference genes for a wide spectrum of expression analyses by quantitative real-time PCR.
Czechowski et al., 2005 previously identified a set of most constitutively expressed A. thaliana reference genes (Czechowski et al., 2005). One gene identified in that study (AT4G34270) was also present in the 27 most stably expressed genes identified in our analysis, while six other genes had an SD/mean expression ratio <0.3 in the Map&Stress dataset. In case of ‘traditional’ reference genes, only UBC (AT5G25760) that was most stable classic reference according to Czechowski et al. has a SD/mean expression ratio less than 0.3, for other genes this value varies from 0.41 to 0.96 (Table S25).
Specificity of transcription factor (TF) expression
In order to gain insights into the potential function of previously uncharacterized regulatory genes, we evaluated the Shannon entropy values of genes in different classes of TFs and other transcription regulators (Figure S6). The lowest median entropy values were seen in TF classes such as MADS, LOB, LIM and MYB. Low entropy values, indicating on narrow expression pattern confined to certain organs or developmental stages, are consistent with the known functions of these genes, which include participating in the development of floral organs and leaves and the transition to flowering (Ng and Yanofsky, 2001). At the other extreme of the distribution range were SWI/SNF-SWI3, SNF2, CAMTA, DDT and FAR, which participate in universal cellular processes such as chromatin remodeling, DNA repair, signaling, and response to light (Jerzmanowski, 2007; Lin et al., 2007).
MADS-box genes are perhaps the most thoroughly studied family of TFs and a variety of approaches have been used to determine their function and evolution, including the characterization of mutants and transgenic plants, genome-wide sequence and expression analysis and reporter genes (Ng and Yanofsky, 2001). However, even for this well-studied family, our transcriptome map provided new information regarding the expression pattern of several members, such as AGL97 (AT1G46408) and AGL52 (AT4G11250), which are expressed in pollen, and AGL51, which is expressed in petioles and internodes.
Another gene family with low entropy was the LOB domain (LBD) family, which has about 45 members. The most studied of these is AS2, which controls aspects of leaf development such as establishment of leaf boundaries, venation and polarity (Semiarti et al., 2001; Xu et al., 2003), and its action is known to be mediated by the repression of KNOX genes (Lin et al., 2003). AS2 also acts in floral organs, where its function is partially redundant with another LOB gene, ASL1 (AT5G66870) (Chalfun-Junior et al., 2005). Such redundancy was also suggested for other members of the family, although a study of the LBD proteins showed that the LOB domain from other proteins cannot functionally replace the AS2 LOB domain, suggesting that the degree of redundancy between AS2 and other LBD proteins is limited (Matsumura et al., 2009). Consistent with this conclusion, we observed divergent expression patterns amongst the LBD genes, some of which were very broadly expressed (e.g. LBD39, LBD37 and LBD11), while others showed narrower expression patterns (LBD10, LBD2 and LBD20). In particular, we found that AT2G31310 (LBD14) likely acts in roots, while AT3G50510 (LBD28) and AT3G13850 (LBD22) function in pollen. We calculated the intra-class Spearman correlation coefficient distribution for each TF class (Figure S7): high correlation values would be expected if expression of all, or most, genes within a class is coordinated, as might occur if the products of these genes constitute a multiprotein complex, such as ribosomal subunits or components of the photosystem. Some TFs are known to act in a complex, such as ‘floral quartets’, which are complexes of MADS-box proteins that regulate floral organ identity (Honma and Goto, 2001). We observed that most TFs within each family showed no evidence of coordinated expression as they had correlation values close to 0. The lack of coordinated expression of TF belonging to one gene family was earlier observed in stress conditions (Chen et al., 2002).
Splicing analysis
Recent studies suggest that splicing and alternative splicing (AS) can be major driving forces in the regulation of gene expression in plants, as they can influence transcript complexity, abundance and stability (Reddy et al., 2013). Previous studies of AS in A. thaliana (Filichkin et al., 2010) have been based on short-read and/or low coverage data and so have had to apply somewhat relaxed criteria for recognition of splicing events. This can lead to erroneous detection of splice sites as a consequence of mapping artifacts (Grant et al., 2011; Li et al., 2013). The large number of samples from different organs and conditions, each with two biological replicates, as well as the high sequence coverage in this current study, allowed us to apply more stringent criteria. We considered only splice junctions that were supported by at least two uniquely mapped spliced reads. As an additional filtering step we used two criteria: FI, where the splice junction (SJ) is taken into consideration if it is observed in at least two samples out of 158, and a more stringent FII criteria where the splice junction (SJ) must be present in both replicates of the sample. Two protocols for mapping and SJ detection were used: the first was based on STAR (hereafter referred to as map-STAR) and the other based on bowtie2 software (map-TopHat2) (Table S26). A total of 133 600 SJs were found by both mapping approaches after applying the FII filter, and map-STAR also revealed 17 500 SJs that were not identified by map-TopHat2 while, conversely, map-TopHat2 found 7800 SJs that were not detected by map-STAR. A more detailed examination of these splice junctions showed that of the 7800 map-TopHat2 unique SJs, only 1316 did not correspond to SJs predicted by the map-STAR protocol, while another 1360 were also predicted by map-STAR but did not pass the FII filter. Moreover, another 5124 SJs overlapped with map-STAR SJs but had alternative 3′- and\or 5′-ends, including examples of exon-skip or alternative acceptor splice sites. As STAR is optimized for spliced read alignment (Dobin et al., 2013) and also identified more SJs than the map-TopHat2 analysis, we used the map-STAR SJs in subsequent analyses. In total, mapping using the map-STAR protocol indicated 348 971 total possible splice junctions, 116 762 of which were already annotated in TAIR10 (hereafter referred to as TAIR10 SJs). However, most of these junctions were poorly supported, so after filtration according to the FI criterion 221 187/115 686 (All SJs/TAIR10 SJs, respectively) remained, while after FII filtering there were 151 209/113 336. We concluded that we had identified 37 873 new SJs after applying the FII filter, but to check the sensitivity of our analytical pipeline to mapping artifacts, the same analysis was performed using a simulated read dataset from A. thaliana genome. Mapping of 5 billion simulated reads resulted in 4760 predicted SJs; however, only 28 of these were also found in the total SJ set that was derived from our experimental RNA-seq data, suggesting that the contribution of mapping artifacts to newly discovered SJs was negligible. Importantly, two features that distinguish our SJ data set from those resulting from previous studies of splicing in A. thaliana are the higher number of reads and the greater diversity of samples, both of which contributed to the detection of new SJs. A comparison of the frequency of TAIR10/new SJs across samples showed that most TAIR10 SJs were observed in all 79 samples, and that few new SJs were present in all samples (Figure 4a). In contrast, new SJs were prevalent among SJs that were observed in few samples (1–10) (Figure 4b). Furthermore, a comparison of the proportion of TAIR10/new SJs within each sample revealed that only a small fraction (~5%) represented new SJs (Figure S8).

Splice junction (SJ) distribution in samples and discovery rate.
(a, b) Number of splice junctions (y-axis – millions of SJs) detected simultaneously in k samples (x-axis – number of samples) after applying filter FI (a) and filter FII (b).
(c, d) SJ discovery rate as a function of the number of reads taken randomly from the total read set for SJs passing filter I (c) or filter I (d). The final three points on each graph show the effect of adding the stress RNA-seq dataset on the new SJ discovery rate. SJs corresponding to TAIR10 annotated introns are shown in blue and new introns are shown in red.
To determine how the number of reads affects the discovery of new SJs, random subsampling of our dataset was used. All reads from the 158 samples were combined into a single dataset and mixed and then subsets of 50, 100, 250, 500, 750, 1000, 1500, 3000 or 4000 million reads were randomly selected and subjected to splice site identification using the map-STAR protocol. The pattern of the discovery of SJs dependent on number of reads was clearly different between TAIR10 SJs and new SJs: while identification of TAIR10 SJs almost reached saturation (~95% of SJs were found) at 100 million reads, a 95% value for detection of new SJs passing FII detection required 1500 million reads, and 3000 for those passing FI (Figure 4c,d). These comparisons demonstrated that the new SJs were rare, confined to specific organs or developmental stages, and that the corresponding transcripts were expressed at low levels, thus requiring high numbers of reads in order to detect them. In order to determine whether our set of new SJs is comprehensive, or whether we might identify more upon targeting another organ or condition, we also analyzed RNA-seq data from plants exposed to three abiotic stress conditions (cold stress, data, high temperature and wounding). The additional stress related expression data yielded relatively few additional SJs (7% and 5% for FI and II, respectively) (Figure 4c,d). Upon examining the location of the new SJs we observed that 10 536 are in intergenic regions and 27 337 are present in TAIR10 annotated genes, which means that ~72% of the new SJs represent previously unreported splice variants of known genes. In addition, SJs from 90 annotated transposable elements (TEs) were found to be spliced, providing supporting for the expression of these TEs.
To conclude, our data reveal that TAIR10 annotations capture most of the frequently and/or highly expressed gene isoforms, but do not include a large portion of rare, and likely highly tissue-/condition-specific, isoforms. This is consistent with recent studies indicating that AS in A. thaliana can be underestimated (Kwon et al., 2014), specifically showing that the ELF3 (AT2G25930) and TOC1 (AT5G61380) genes undergo extensive AS (also confirmed in current study), although no corresponding isoforms are indicated in TAIR10.
We have created a high resolution transcriptome map of A. thaliana based on RNA-seq data, which includes 79 samples, each with two biological replicates, corresponding to different developmental stages and parts of roots, leaves, flowers, seeds, siliques and stems. Collectively, the expression data contained most annotated protein-coding genes (24 621 out of 27 201) and it was notable that the addition of an independent expression dataset derived from plants exposed to stress did not substantially increase the number of expressed genes. Amongst the non-expressed genes we found GO enrichment in categories related to defense from biotic stresses, such as responses to fungi and viruses. This situation reflects the fact that plants were not exposed to such conditions and highlights condition-specific roles of these genes. We also determined that the set of genes that were the most constitutively expressed across samples were enriched in GO categories related to nucleic acid transport and membrane processes. In order to identify potential new regulators, we analyzed the Shannon entropy values and expression patterns of TF genes (Figures S6 and S7), and found that TFs that regulate ‘local’ biological processes, such as the transition to flowering, patterning of lateral organs and root development (MADS, LBD, MYB) had the lowest median entropy. Conversely, SWI/SNF-SWI3 and SNF2, which regulate general processes as chromatin packing, have median Shannon entropy values near the maximum. We examined the expression patterns of several TFs that have not been the subject of previous studies and this suggested their organ-specific regulatory function, as exemplified by LBD14 in roots and both LBD22 and LBD28 in pollen.
Remarkably, all the samples showed a similar distribution of gene expression levels, with minor variations between samples, as well as similar ratios of GO categories. This underlines the relative uniformity of global gene expression throughout the plant, and suggests that functional specialization in different organs and tissues is associated with changes in the expression of relatively few genes, rather than while wholesale shifts in the transcriptome profile. These data represent a valuable resource for the plant science community and are accessible in a public database TraVA (http://travadb.org/). Extensive sampling and the exceptionally large amount of expression data allowed us to identify a substantial number of new SJs, including those in regions that were annotated as non-coding in the current version of the A. thaliana genome annotation (TAIR10). These regions potentially represent new genes and we expect that this information will assist in more accurate genome annotation.
Experimental procedures
Plant growth and sample collection
Arabidopsis thaliana (ecotype Col-0; accession CS70000) plants were obtained and grown as described in Klepikova et al. (2015). Samples were hand dissected as described in Table S1; each sample contained tissue from 15 individuals in two biological replicates. Harvested samples were collected from 10 to 11 h after dawn and fixed in RNALater (Qiagen, Venlo, Netherlands).
For the Stress dataset, plants were grown in the same way as for the Map dataset except that they were vernalized (kept for 3 days at 4°C). After 3 weeks after germination plants were exposed to low or high temperatures or to mechanical wounding. For the low temperature treatment, the temperature in the climate chamber was set to +4°C. The third leaf was collected from 15 plants in two replicates after 1, 3, 6, 12 or 24 h of cold treatment. The high temperature treatment was conducted similarly, with a temperature of +42°C. For the mechanical wounding treatment, the third leaf of 15 plants in two replicates was pierced with a needle and then collected at 1, 3, 6, 12, 24 or 48 h after wounding.
RNA extraction and sequencing
Total RNA was extracted using an RNeasy Plant Kit (Qiagen) following the manufacturer's protocol. Illumina cDNA libraries were constructed with the TruSeq RNA Sample Prep Kits v2 (Illumina, San Diego, CA, USA) following the manufacturer's protocol. Sequencing of the cDNA libraries was performed using an Illumina HiSeq2000 with a 50-bp read length and a sequence depth ~20 million uniquely mapped reads.
Sequence trimming, mapping and expression level determination
Reads were trimmed using the CLC Genomics Workbench 6.5.1 (CLC bio, Denmark) with the following parameters: ‘quality scores – 0.005; trim ambiguous nucleotides – 2; remove 5′ terminal nucleotides – 1; remove 3′ terminal nucleotides – 1; discard reads below length 25’. Trimmed reads were mapped using the RNA-seq mapping algorithm implemented in the CLC Genomics Workbench to the reference A. thaliana genome (TAIR10) allowing only unique mapping (length fraction = 1, similarity fraction = 0.95). In order to estimate the influence of non-uniquely mapped reads on gene expression we also mapped reads using the same software and parameters as indicated above, but allowing multiple mapping (up to 10 hits). For each gene, total gene reads (TGR) was determined as the sum of all reads mapped to this gene. To avoid bias due to different library sizes, TGR values were normalized by a size factor as described in Anders and Huber (2010).
Determination of expressed genes
Genes with a normalized TGR of ≥16 (as recommended in Su et al., 2014) in two replicates of a sample were considered as expressed in that sample (Su et al., 2014). Expressed genes were defined as genes expressed in at least one sample.
Z-Score determination

Identification of DE genes
Differentially expressed genes were identified using the R package ‘DESeq’ (Anders and Huber, 2010). A false discovery rate (FDR) of 0.05 and a fold change of 2.0 were chosen as the threshold for significantly differential expression.
DE score determination
For simultaneous analysis of 3081 sample comparisons an additional correction for multiple testing was calculated. All FDR values for 33 323 genes (identified by ‘DESeq’) in 3081 sample comparisons were taken together and ‘p.adjust’ function from R package ‘stats’ was applied for FDR calculation for these values. A gene was considered as DE when FDR of FDR < 0.05 and fold change >2. After that, DE score for gene was defined as number of sample comparisons in which this gene is differentially expressed.
GO enrichment analysis
Downregulated and upregulated DE gene lists were analyzed for GO and other annotations (as key words or protein domain) enrichment using the DAVID gene functional annotation tool, with an FDR value of 0.05 and a fold change of category representation of 2.0 as the threshold of significance (Huang et al., 2009a,b).
GO categories at the third level of the GO tree
To obtain GO categories at the third level of the GO tree, OBO-Edit2 software was used (Day-Richter et al., 2007).
Hierarchical clustering
A hierarchical tree was made using the ‘hclust’ function from the R package, ‘stats’ (R Core Team, 2013).
Definition of stable genes
For each gene, the mean and SD of the expression levels were calculated for the ‘Map’ and ‘Stress’ datasets and for the combined datasets. Genes with a ratio of SD/mean <0.2, <0.25 or <0.3 were considered to be stable for the respective datasets.
Shannon entropy
To identify genes with narrow or wide patterns of expression, Shannon entropy (H) values were calculated for each gene as in Schug et al. (2005). As several samples were combined in a developmental series but others were not, samples were grouped using hierarchical clustering: samples with a distance <0.3 were grouped (the sample combination is described in Table S15) and gene expression levels were averaged.
Scanning electron microscopy (SEM)
After fixation in 70% ethanol, samples were transferred to 80% ethanol for 15 min, 96% ethanol for 15 min, ethanol:acetone (1:1) for 1 h and then fresh acetone three times, each for 30 min. Then samples were dried in a critical-point dryer, mounted on iron stages and coated with platinum and palladium at 10–20 nm thickness. Imaging was carried out using an electron microscope, JSM-6380 (JEOL, Tokyo, Japan), with an acceleration voltage of 15–20 kV. SEM images were processed using Adobe Photoshop.
Discovery of new SJs
Our protocol for SJ detection, referred to as map-STAR, mapped RNA-seq reads to the reference genome (A. thaliana TAIR10 release) with STAR (v.2.4.0) (Dobin et al., 2013), using the following settings: –outFilterMismatchNmax 2,–outSJfilterCountUniqueMin 3 1 1 1, –outSJfilterCountTotalMin 3 1 1 1, –alignIntronMin 15. The resulting alignments were converted into binary format with SAMtools v.018 (Li et al., 2009), and the binary alignment files were treated with bam2hints (with parameter ‘–introns only’) from the augustus v.2.7 package (Stanke et al., 2006). SJ sets were obtained for both replicates of 79 samples (a total of 158 sets). To remove low supported SJs and possible artifacts, we used two filters. SJs passed FI they were found in at least two of the 158 sets, and SJs passed FII if they were found in both replicates of the sample. FII was more strict, so SJs passing FII automatically passed FI. A similar procedure was performed using TopHat2 (v.2.0.10) (Kim et al., 2013) with bowtie2 (v.10) (Langmead and Salzberg, 2012) and downstream processing with SAMtools and bam2hints. Random simulations of 50, 100, 500 or 5000 million reads from the A. thaliana genome were also performed and these sets were also processed using the map-STAR method.
Random saturation test
All reads from the 158 samples (79 points with two replicates each) were mixed into one set. Sets of 50, 100, 250, 500, 750, 1000, 1500, 3000 and 4000 million reads were randomly selected from this pool and mapped with STAR onto the reference genome. SJs were extracted with bam2hints (‘—introns only’). A final three points were obtained by adding the stress data: first using the cold stress reads, then adding the high temperature stress reads and finally adding the wound stress reads, giving a 4.6 billion reads. Since a filter could not be applied to a single sample, without replicates (e.g. the 100 M reads point), SJs that passed FI or FII, during the SJ identification using STAR (for a total of 158 development samples and 30 stress samples) were selected from all SJs for each sample.
Accession numbers
The Illumina sequence reads have been deposited into the NCBI Sequence Read Archive (project ID PRJNA314076 for Map dataset and project ID PRJNA324514 for Stress dataset). Sequence reads for the meristem samples are available in the NCBI Sequence Read Archive (project ID PRJNA268115).
Acknowledgements
The authors are grateful to Alexey S. Kondrashov for providing access to high-throughput sequencing facilities (created under the project no. 11.G34.31.0008) and to Artur Zalevsky for help with database support. Preliminary results were obtained using support from the Russian Foundation for Basic Research grant no. 12-04-33032. Sequencing and final results were obtained through the Russian Science Foundation grant project no. 14-50-00150. Plant growth and morphological analysis was performed using facilities at the Lomonosov Moscow State University, Department of Genetics. SEM was performed at the Laboratory of Electron Microscopy of the Lomonosov Moscow State University Biological Faculty. We thank PlantScribe (http://www.plantscribe.com/) for editing this manuscript.
Author contributions
AVK collected plant material, generated images, carried out most of the computational analyses and participated in manuscript writing. ASK developed the database. ESG carried out the splicing analysis. MDL participated in the design and coordination of the study, contributed to sequencing and writing. AAP conceived and coordinated the study, constructed the transcriptome libraries, designed the final figures and participated in the sequencing and computational analysis. All authors read and approved the final manuscript.
Conflict of interest statement
The authors declare no conflicts of interest.