MP3RNA-seq: Massively parallel 3′ end RNA sequencing for high-throughput gene expression profiling and genotyping
Edited by: Long Mao, Institute of Crop Sciences, CAAS, China.
Abstract
Transcriptome deep sequencing (RNA-seq) has become a routine method for global gene expression profiling. However, its application to large-scale experiments remains limited by cost and labor constraints. Here we describe a massively parallel 3′ end RNA-seq (MP3RNA-seq) method that introduces unique sample barcodes during reverse transcription to permit sample pooling immediately following this initial step. MP3RNA-seq allows for handling of hundreds of samples in a single experiment, at a cost of about $6 per sample for library construction and sequencing. MP3RNA-seq is effective for not only high-throughput gene expression profiling, but also genotyping. To demonstrate its utility, we applied MP3RNA-seq to 477 double haploid lines of maize. We identified 19,429 genes expressed in at least 50% of the lines and 35,836 high-quality single nucleotide polymorphisms for genotyping analysis. Armed with these data, we performed expression and agronomic trait quantitative trait locus (QTL) mapping and identified 25,797 expression QTLs for 15,335 genes and 21 QTLs for plant height, ear height, and relative ear height. We conclude that MP3RNA-seq is highly reproducible, accurate, and sensitive for high-throughput gene expression profiling and genotyping, and should be generally applicable to most eukaryotic species.
INTRODUCTION
Quantification of transcript levels is of fundamental importance to understand the molecular function and regulation of any gene. Accordingly, various approaches have been developed to measure the transcript levels of genes, including Northern blot analysis (Alwine et al., 1977), reverse transcription followed by quantitative polymerase chain reaction (PCR; Beckerandré and Hahlbrock, 1989; Vy and Filion, 2014), microarray-based technologies (Schena et al., 1995; Conway and Schoolnik, 2003; Hoheisel, 2006), serial analysis of gene expression (SAGE), (Velculescu et al., 1995), massively parallel signature sequencing (MPSS) (Brunner, 2000), and transcriptome deep sequencing (RNA-seq) (Wang et al., 2009; Mutz et al., 2013). The emergence and rise of these approaches have greatly promoted our understanding of gene expression and its underlying regulatory mechanisms. In particular, the advent of next-generation sequencing technologies has revolutionized our ability to profile transcript levels. With current attainable read lengths of over 100 nucleotides, RNA-seq has become a routine method of choice, as it supports expression analysis of nearly all genes with less bias than other methods, with tremendous dynamic detection range (Wang et al., 2009; Hrdlickova et al., 2017).
A typical RNA-seq experiment involves messenger RNA (mRNA) isolation and purification, fragmentation, reverse transcription, ligation of sequencing adapters and final library amplification. However, as samples are processed separately during library construction, traditional RNA-seq experiments do not scale up easily and are constrained by cost and labor. In addition, a prototypical RNA-seq analysis requires sequencing reads that cover the entire transcript derived from each gene, thereby necessitating sufficient sequencing depth to obtain reliable results with statistical power. As an alternative to RNA-seq methods that query the entire coding sequence, a number of targeted 3ʹ end-based RNA-seq methods have been developed, such as polyadenylated (poly(A)) site sequencing (PAS-seq; Shepard et al., 2011), poly(A)-sequencing (poly(A)-seq; Derti et al., 2012), 3ʹT-fill (Wilkening et al., 2013), poly(A)-position profiling by sequencing (3P-seq; Jan et al., 2011), 3ʹ region extraction and deep sequencing (3ʹ READS; Hoque et al., 2013), and Poly(A)-test RNA sequencing (PAT-seq; Harrison et al., 2015). These methods can reduce the extent of sequencing coverage needed for analysis and thus reduce the sequencing cost. Yet these 3ʹ end-based methods are still processed one RNA sample at a time and do not efficiently handle many samples at once. Unfortunately, many types of study require large numbers of samples, such as high resolution spatial-temporal gene expressional analysis (Liu et al., 2013; Tan et al., 2013; Chen et al., 2014), population-scale gene expressional variation analysis in plant and human genomes (Fu et al., 2013; Battle et al., 2014; Consortium, 2015; Li et al., 2016; Wang et al., 2017), and medical biomarker screening (Garnett et al., 2012; Mcmillan et al., 2018). While a 3ʹ end sequencing-based method, TranSeq, was recently developed specifically for high-throughput transcriptome analysis (Tzfadia et al., 2018), the RNA fragmentation and selection of poly(A) mRNA steps were still conducted on individual samples, which increased the time, cost and effort of the overall procedure.
Here, we describe a massively parallel 3ʹ end RNA-seq (MP3RNA-seq) method requiring no prior poly(A) enrichment or ribosomal RNA (rRNA) depletion. In our method, single-stranded complementary DNA (cDNA) samples can be pooled immediately after reverse transcription, during which each sample was processed with a unique barcode upstream of the oligo d(T) primer. In addition, we introduced sequencing adapters, loaded onto the Tn5 transposase, by the tagmentation approach, in order to perform fragmentation and attachment of sequencing adapters in a single step. We then captured and enriched the 3ʹ end of transcripts by PCR amplification. These optimized steps greatly simplify the procedure, save on reagents, and reduce the sequencing data needed for MP3RNA-seq library construction, sequencing and analysis. Moreover, we introduced a two-stage barcode structure for MP3RNA-seq, which greatly enhances its throughput. Using samples from maize, Arabidopsis, mouse and human, we demonstrate that MP3RNA-seq displays high reproducibility, accuracy and sensitivity in quantifying gene expression as compared to the typical RNA-seq method. Overall, MP3RNA-seq is a minimalist approach to high-throughput transcriptome profiling, with an average cost of only about $6 per sample for library construction and sequencing, about one-tenth the cost of traditional RNA-seq. Furthermore, MP3RNA-seq is well suited for genotyping, as its cost is as low as that of genotyping-by-sequencing (GBS), a widely used restriction enzyme-based genotyping method (Elshire et al., 2011), but it offers the distinct advantage of allowing the detection of single nucleotide polymorphisms (SNPs) in genic regions.
To demonstrate the power of MP3RNA-seq, we performed expression quantitative trait locus (eQTL) mapping on 477 double haploid (DH) maize lines and identified 25,797 eQTLs for 15,335 genes, including 117 trans-eQTL hotspots. We also performed classical QTL mapping for plant height (PH), ear height (EH) and relative ear height (EH/PH), resulting in the identification of 21 QTLs, including a region between 20 Mb and 30 Mb on chromosome 2 that overlapped for all traits. Together, our results demonstrate the versatility of MP3RNA-seq for high-throughput gene expression and genotyping analysis.
RESULTS
Overview of MP3RNA-seq
MP3RNA-seq (Figure 1) comprises six steps. (i) Total RNA is extracted from individual samples, quantified, and dispensed across 96-well plates. As total RNA is used as template for reverse transcription, no prior poly(A) enrichment or rRNA depletion is needed. (ii) Reverse transcription is performed in each well, using a primer containing an oligo d(T) sequence (to capture mRNAs), unique molecular identifiers (UMIs), a first barcode, and the 3ʹ Illumina sequencing adapter (Figure S1). The inclusion of UMIs at this step allows the identification of PCR duplicates and thus improves quantification accuracy (Kivioja et al., 2012; Islam et al., 2014; Smith et al., 2017). Each well receives a distinct first barcode to distinguish samples. (iii) The first-strand cDNA samples from the same plate are pooled into a single tube, followed by synthesis of the second strand. (iv) Tagmentation of the double-stranded cDNAs is mediated by the Tn5 transposase, pre-assembled with the 5ʹ Illumina sequencing adapter. (v) Polymerase chain reaction amplification is performed using primers complementary to the 5ʹ sequencing adapter introduced by reverse transcription and the 3ʹ sequencing adapter introduced by Tn5 insertion, such that the 3ʹ ends of transcripts are specifically captured in the PCR product. At this stage, a second barcode is introduced for each pool by PCR to help distinguish different pools. (vi) The resulting libraries are sequenced as paired-end Illumina libraries, to generate counts for the 3ʹ ends of transcripts. Sequencing reads are assigned to the corresponding samples according to the combinatorial indexing from the first and second barcodes. The introduction of the first barcode during reverse transcription allows parallel processing of many samples in a single experiment. Moreover, the choice of a 3′ end-focused sequencing strategy reduces the total amount of sequence data for transcriptome profiling. Thus, the MP3RNA-seq method is not only high-throughput, but also cost-effective. A rough estimate of the total cost associated with MP3RNA-seq is about $6 per sample, when several hundred samples are processed in parallel; this price includes library construction and sequencing costs (Table S1).

Schematic representation of the massively parallel 3′ end RNA-seq (MP3RNA-seq) workflow
Reverse transcription reaction is performed for individual samples separately with reverse transcription primer. The samples from the same 96-well plate are pooled together, and then second-strand synthesis, tagmentation and polymerase chain reaction (PCR) enrichment are performed in turn. The barcode 1 and barcode 2 are introduced via reverse transcription reaction and PCR amplification, respectively. Sequencing reads can be assigned to the samples according to the combinatorial indexing of barcode 1 and 2.
Technical evaluation of MP3RNA-seq
To assess the technical performance of our method, we applied MP3RNA-seq to stem and leaf tissue collected from the maize inbred lines Chang7-2 and PHBA6, as well as stems harvested from Arabidopsis plants. For a point of comparison, we also determined the transcriptome profile of these samples by following a typical RNA-seq protocol. The number of MP3RNA-seq reads generated for the different tissues ranged from 0.75 to 2.35 million, of which 58% on average mapped uniquely to the corresponding references of maize and Arabidopsis genomes (Table S2). Duplicated reads, which harbor the same UMI, were then removed for uniquely mapped reads. Of the reads that mapped uniquely and were not duplicated, about 82.2% were mapped to exonic regions (Table S2), a percentage that was slightly lower than that of the typical RNA-seq method at 95.3% (Table S3). This might be due to the sequencing of only 3ʹ ends of transcripts for MP3RNA-seq, which reduced the relative ratio of reads stemmed from the exonic regions as compared with that of typical RNA-seq data, in which reads evenly covered the transcript but were depleted at both the 5ʹ and 3ʹ ends (Figure 2A). Together, these results demonstrate the specificity of MP3RNA-seq for capturing the 3ʹ end of poly(A) transcripts.

Demonstration of technical performance of massively parallel 3′ end RNA-seq (MP3RNA-seq)
(A) The comparison of reads coverage along the transcripts between MP3RNA-seq and RNA-seq. (B) Scatter plot of unique molecular identifier (UMI) counts from each sample of maize and Arabidopsis for MP3RNA-seq. (C) Technical reproducibility of gene expression level quantified by MP3RNA-seq. (D) Comparation of gene expression level quantified by MP3RNA-seq and RNA-seq. (E) Comparation of the number of expressed gene identified by MP3RNA-seq and RNA-seq. The thresholds of transcripts per million > 1 and fragments per kilobase per million > 1 were used to define expressed genes for MP3RNA-seq and RNA-seq, respectively. (F) Correlation between the number of usable reads and the number of expressed gene identified for MP3RNA-seq. Usable reads here represent uniquely mapped and non-duplicated reads.
We generated the MP3RNA-seq data for the maize and Arabidopsis samples in a single experiment. The distribution of UMI counts showed that the reads originating from maize and Arabidopsis samples overwhelmingly (~99.5%) mapped to their corresponding reference genomes (Figure 2B), indicating a low level of crosstalk between the specific barcodes used for each sample. Next, we wished to determine the reproducibility of MP3RNA-seq in quantifying transcript levels. We used only uniquely mapped and non-duplicated reads to estimate normalized transcript levels as transcripts per million (TPM). A comparison of technical replicates illustrated their very high correlations (r = 0.94–0.97; Figures 2C, S2), thus demonstrating the high reproducibility of the MP3RNA-seq method. We also compared relative transcript levels estimated with MP3RNA-seq and typical RNA-seq and found the methods were largely concordant (r = 0.87–0.92) (Figure 2D; Table S2). The observed correlation between the two methods is comparable to that seen with other reported 3ʹ end-focused RNA-seq methods (Wilkening et al., 2013; Harrison et al., 2015). These results attested to the quantification accuracy of the MP3RNA-seq method. Like other transcript-level assay methods, the reproducibility and quantification accuracy of MP3RNA-seq was better for genes with high expression levels, as compared to more weakly expressed genes.
The sensitivity of MP3RNA-seq is also critical for its potential applications. We detected an average of 19,258 expressed genes in maize and 18,029 expressed genes in Arabidopsis, using a minimum expression of at least one TPM. The number of genes deemed expressed by MP3RNA-seq was comparable to that identified in RNA-seq with a threshold of fragments per kilobase per million (FPKM) greater than one (Figure 2E). Further analysis showed that the expression of about 20,000 genes could be detected for maize and Arabidopsis with 0.8 million usable MP3RNA-seq reads (Figure 2F). MP3RNA-seq also identified an average of 15,153 expressed genes in mouse liver and heart samples and 15,570 expressed genes in human HeLa cells (Table S2). The expression values between the two technical replicates were highly correlated for both mouse tissues (r = 0.96 and 0.99) and for HeLa cell samples (r = 0.87) analyzed here (Figure S2; Table S2). Overall, these results indicate the effectiveness of MP3RNA-seq in quantifying transcript levels in Arabidopsis, maize, mouse and human samples. We conclude that this method should be generally applicable to most eukaryotic species, as their mature mRNAs carry a poly(A) tail.
Genotyping and gene expression profiling for DH lines by MP3RNA-seq
To test the power of our method on many samples, we performed MP3RNA-seq on a DH population with 477 lines derived from a cross between the maize inbred lines Chang7-2 and PHBA6. We harvested the stem under the shoot apical meristem at the elongation stage for all lines, extracted total RNA for all samples and processed them as described above for the MP3RNA-seq pipeline. Sequencing generated 1.33 billion reads, corresponding to an average of 2.8 million reads per line (Data Set S1). The raw reads were mapped to the maize B73 reference genome (v4) (Jiao et al., 2017). Only the uniquely mapped and non-duplicated reads were used to quantify gene expression. Average 19,475 expressed genes were detected for each line. We found a total of 19,429 genes were expressed in at least 50% of the DH lines, almost all (19,312; 99.4%) of which were expressed in at least one of the two parents.
We then tested the utility of our MP3RNA-seq reads for genotyping. We detected 35,836 high-quality SNPs between the two parental lines, of which 85% (30,621) were within exons (Table S4). On average, we identified 19,907 SNPs per DH line, with a range of 5,549–34,093. An analysis of the genotype blocks in the DH genomes, as determined by MP3RNA-seq output, captured 8 109 crossover events, with an average of 1.7 per chromosome per DH line (Figure S3; Table S5). Next, a genetic linkage map was constructed, with a total genetic length of 857 cM (Table S5). The average distance between neighboring markers ranged from 0.11 cM to 8.85 cM, with a mean of 0.403 cM.
eQTL mapping
Utilizing the gene expression profiles and genotyping information obtained with MP3RNA-seq, we performed eQTL mapping for 19,429 genes expressed in at least 50% of the DH lines. We identified 25,797 eQTLs for 15,335 genes (Figure S4), including 8,144 (31.57%) cis-eQTLs and 17,653 (68.43%) trans-eQTLs. We found that 4,247 (27.69%) genes with eQTLs were controlled by both cis- and trans-eQTLs, while another 3,897 (25.41%) were controlled only by cis-eQTLs and 7,191 (46.90%) only by trans-eQTLs (Figure 3A). Of the genes with eQTLs, about half (52.02%) had only a single eQTL affecting their expression levels (Table S6). The logarithm of odds (LOD) values for cis-eQTLs were globally significantly higher (Fisher's exact test, P < 2.2 × 10−16) than those of trans-eQTLs, indicating a larger correlation between cis-elements and variation in gene expression (Figure 3B). In addition, the extent of expression variations for genes with trans-eQTLs decreased with the increase of associated trans-eQTLs and was significantly lower (Fisher's exact test, P < 2.2 × 10−16) than that of genes with cis-eQTLs (Figure 3C). On average, each cis-eQTL and trans-eQTL accounted for about 28.4% and 6.9%, respectively, of the variation in gene expression across the DH population (Table S7). The expression levels of genes with trans-eQTLs increased with the increase of associated trans-eQTLs and were significantly higher (Fisher's exact test, P < 2.2 × 10−16) than those of genes with cis-eQTLs (Figure 3D). These results suggest that cis-eQTLs and trans-eQTLs have distinct effects on both gene expression variation and gene expression level.

Characterization ofcis-andtrans-eQTLs (expression quantitative trait loci) and their effect on gene expression level and expression variation
(A) The distribution of genes regulated by cis- and/or trans-eQTL. (B) Comparison of the logarithm of odds (LOD) values of cis- and trans-eQTLs. (C) The relationship between the variation of gene expression levels and the types of eQTLs, as well as the number of trans-eQTL mapped for genes. CV represents the coefficient of variance. (D) The relationship between gene expression level and the types of eQTLs, as well as the number of trans-eQTL mapped for genes.
We next investigated the genomic distribution of trans-eQTLs that might contain potential master regulators modulating the expression of a suite of downstream genes (Kliebenstein, 2009). We scanned the genome for trans-eQTL hotspots using a sliding window approach, with windows of 1 Mb and steps of 100 kb. In total, 117 significant (Fisher's exact test, P < 0.05) trans-eQTL hotspots were identified (Figure 4A; Data Set S2), which regulate the expression of 7,680 genes, accounting for about 67.1% of genes with identified trans-eQTLs. Twenty-seven trans-eQTL hotspots regulated over 100 genes each. Gene Ontology (GO) and pathway analysis revealed enrichment in specific functional categories or metabolic pathways for the target genes of 24 (21.4%) trans-eQTL hotspots (Data Sets S3, S4). For instance, six genes (Benzoxazinless (Bx) 1–5, Bx8) involved in DIMBOA-glucoside biosynthesis on chromosome 4 (Frey and Gierl, 1997; Figure 4B) form a co-regulated gene cluster (Wang et al., 2017). Here, our results determined that they were all regulated by trans-eQTL hotspot #84 on chromosome 7 (117.6–122.3 Mb). In addition, the higher expression levels of these six genes were all positively associated with the presence of the PHBA6 allele (Figure 4C), suggesting that they might be regulated by the same regulator on chromosome 7. The expression levels of the target genes of all 17,653 trans-eQTLs were roughly equally associated with each parental allele, with 52.6% of genes being positively correlated with the Chang7-2 allele and 47.4% of genes with the PHBA6 allele. By contrast, more than half (52.9%) of trans-eQTL hotspots displayed a significant (Fisher's exact test, P < 0.05) directional haplotype bias (high expression of target genes of the same trans-eQTL hotspots was preferentially associated with the haplotype of the same allele) for the regulation of gene expression, which average the higher expression levels of 65.2% of the target genes were positively associated with the haplotype of the same allele (Data Set S2). This directional haplotype bias was even more pronounced when analyzing genes with the same functional categories. Of 69 GO categories that were enriched in the target genes of hotspots, 97.0% (67) showed a significant (Fisher's exact test, P < 0.05) directional haplotype bias for the regulation of gene expression, in which the average higher expression levels of 93.2% of the target genes were positively associated with the haplotype of the same allele (Data Set S3). These results highlight the role of trans-eQTL hotspots in modulating the cooperative expression of genes involved in the same functional pathways.

Genes regulated by thetrans-eQTL (expression quantitative trait locus) hotspots
(A) The distribution of trans-eQTL hotspots. The y-axis represents the number of trans-eQTL in each 1 Mb length genomic regions. The trans-eQTL hotspots are shown by red color. (B) The pathway of DIMBOA-glucoside biosynthesis in maize. (C) Correlation between the expressions of Benzoxazinless (Bx) genes and the genotype of the trans-eQTL hotspot #84 on chromosome 7. The boxplot shows the expression differences between Chang7-2 and PHBA6 alleles at each Bx gene. Asterisks represent significant difference for Bx genes expression levels at a threshold of P < 0.01 (Student's t-test).
We next tried to identify the putative master regulators in these trans-eQTL hotspots, mainly focusing on transcription factors (TFs). For the target genes of each trans-eQTL hotspot, a subset of genes belonging to the same GO term was selected to predict a potential conserved TF binding site (TFBS). We searched for sequence motifs of 5–12 bp in length located between –1 kb and +0.2 kb relative to the transcription start site of each gene. We predicted 863 motifs that were significantly similar (q < 0.05) to known motifs in the Arabidopsis TF database (JASPAR) (Data Set S5). We next asked which TFs within the trans-eQTL hotspots might bind to these putative TFBSs. The expression of master regulators at trans-eQTL hotspots is often regulated by cis-acting mechanisms (Albert and Kruglyak, 2015). Thus, only TFs that were controlled by cis-eQTLs within the trans-eQTL hotspots were further considered. Four TFs, the homologs of which in Arabidopsis had motifs that are significantly similar to the motifs predicted in the corresponding trans-eQTL hotspot, were considered as putative master regulators in these trans-eQTL hotspots (Data Set S6). For instance, the TF Dof2 (DNA-binding with one finger 2, Zm00001d005970), which is involved in carbon and nitrogen metabolism (Yanagisawa, 2000; Gupta et al., 2014), might be a master regulator for the target genes of trans-eQTL hotspot #19 on chromosome 2 (Data Set S6). Similarly, the MIKC-type MADS domain TF ZAG6 (Zea Mays AGAMOUS-LIKE 6; Zm00001d027425) might be a master regulator for trans-eQTL hotspot #2 on chromosome 1 (Data Set S6).
Quantitative trait locus mapping for PH, EH, and PH/EH
Understanding genetic control of PH and EH in maize is important due to their relationship with grain yield and lodging resistance. The phenotypic traits of PH, EH, and EH/PH differed largely between Chang7-2 and PHBA6 and followed a normal distribution in the DH population (Figure S5). We performed QTL mapping for PH, EH, and EH/PH in the DH lines using the genetic linkage map constructed with the SNPs detected by MP3RNA-seq. Twenty-one QTLs were detected with a significance threshold of LOD larger than five: six QTLs for PH, eight for EH, and seven for EH/PH (Figure 5; Table S8).

Mapping of quantitative trait loci (QTLs) controlling ear height (EH), plant height (PH), and EH/PH in the double haploid (DH) population
Curves in plot indicate the genetic map (x-axis) and logarithm of odds (LOD) score (y-axis) of detected QTLs. The horizontal broken line indicates the threshold used for defining QTLs.
All of the 21 QTLs were minor-effect QTLs, explaining 2–13.5% of the phenotypic variation, with a mean of 6.55%. Six chromosomal regions (Figure 5), ranging in size from 1.78 Mb to 12.94 Mb, contained QTLs for more than one trait, according to close correlation among PH, EH, and EH/PH (Table S9). For instance, the intervals defined by qPH2, qEH2b and qEH/PH2b overlapped on chromosome 2 from 21.2 Mb to 29.8 Mb (Figure 5; Table S8). qEH2b had the largest effect on EH, explaining 10.9% of the phenotypic variation (Figure 5). The QTL with the largest effect on EH/PH (qEH/PH8), mapped between 120.9 Mb and 127.4 Mb on chromosome 8, explained 13.5% of the phenotypic variation, and overlapped with qEH8 (Figure 5; Table S8). In addition, one remarkable QTL (qEH/PH3) located between 182.9 Mb and 186.2 Mb on chromosome 3, explained about 13% of the phenotypic variation for EH/PH and overlapped with qEH3. For the PH trait, the most significant QTL (qPH7) mapped between 159.8 Mb and 166.6 Mb on chromosome 7 and explained 9.5% of the phenotypic variation. Of the seven QTLs identified for EH/PH, five (qEH/PH2a, 2b, 3, 4, 8) overlapped with QTLs for EH, but only one (qEH/PH2b) overlapped with a QTL for PH (Figure 5; Table S8). This result was in line with the stronger correlation between EH and EH/PH than between PH and EH/PH (Table S9).
DISCUSSION
We present here MP3RNA-seq, a high-throughput transcriptome profiling method that can process hundreds of samples in parallel. By pooling many samples from one experiment in a single tube immediately after first-strand cDNA synthesis and sequencing the 3′ ends of transcripts, MP3RNA-seq significantly reduce labor and cost for library construction and sequencing. The incorporation of UMIs in MP3RNA-seq provides a useful filter to remove PCR duplicates and therefore increase the accuracy of transcript levels being measured (Kivioja et al., 2012; Smith et al., 2017). MP3RNA-seq is inexpensive, with the cost for library construction and sequencing together being approximately one-tenth of the cost of typical RNA-seq. Assuming that RNA samples are available for processing, one individual can proceed through library construction for hundreds of samples within 1–2 d.
Some of the strategies used in MP3RNA-seq have also been adopted for a number of high-throughput single-cell RNA-seq (scRNA-seq) methods, which analyze the transcriptome of thousands of cells in a single experiment. These scRNA-seq methods, such as CEL-seq (Hashimshony et al., 2012), CytoSeq (Fan et al., 2015), droplet-based methods like Drop-seq (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017), and sci-RNA-seq (Cao et al., 2017), have all proven very powerful. However, there are several critical differences between MP3RNA-seq and scRNA-seq methods. For example, scRNA-seq employs a random barcoding strategy, which cannot associate an individual cell with a given barcode. For MP3RNA-seq, RNA extraction and reverse transcription are performed separately for each sample, keeping sample identity precisely known at all times and confidently associating each sample with its unique barcode combination. In addition, MP3RNA-seq is designed for profiling tissue or cell population samples, while scRNA-seq is focused on single-cell samples. Due to the extremely low amount of input material (one cell), current scRNA-seq methods are restricted by low capture efficiencies and high levels of technical noise (Liu and Trapnell, 2016). As reported in human and mouse, these technologies can positively detect only about 4 000–7 000 expressed genes in each cell (Macosko et al., 2015; Cao et al., 2017; Zheng et al., 2017). Here, we showed that MP3RNA-seq is highly sensitive for transcriptome profiling. We identified on average about 20,000 genes in maize and Arabidopsis tissues by MP3RNA-seq, numbers that were similar to those obtained via typical RNA-seq protocols. However, as with other 3ʹ end-focused gene expression profiling methods, MP3RNA-seq is limited in its ability to distinguish alternative splice forms due to its strong 3ʹ end bias.
The TranSeq method was recently described for large-scale transcriptome assays (Tzfadia et al., 2018). Like MP3RNA-seq, TranSeq was developed around early sample pooling and a 3′ end-focused strategy. In the case of TranSeq, poly(A) RNA molecules should be first fragmented by heating the samples before selecting mRNAs using oligo d(T) Dynabeads for the next step of reverse transcription. Each of these tedious processes must be performed on all samples individually, which increases the time, cost and effort of the overall procedure. In addition, the double-stranded adapter is ligated by ligase for TranSeq (Tzfadia et al., 2018). By contrast, MP3RNA-seq uses total RNA for reverse transcription directly, without prior selection of poly(A) mRNA, which simplifies the protocol and reduces the overall cost for library construction. We also introduced the use of the tagmentation approach with the Tn5 transposase during MP3RNA-seq. The main advantage of the Tn5 transposase is its ability to fragment DNA and attach the sequencing adapter in a single step (Adey et al., 2010), maximizing RNA utilization for library construction. The Tn5 transposase does present some sequence bias during fragmentation that is slightly higher than that seen by mechanical methods (Adey et al., 2010). Nonetheless, this bias does not constitute an obvious hindrance to appropriate genome-wide coverage (Tang et al., 2009; Wang et al., 2013; Picelli et al., 2014). At present, Tn5 transposase is widely used for RNA-seq, including 3ʹ end-based scRNA-seq methods (Tang et al., 2009; Picelli et al., 2014; Cao et al., 2017; Hrdlickova et al., 2017).
Several high-throughput approaches have been developed to facilitate the discovery of tens of thousands of SNP markers suitable for various purposes (Davey et al., 2011). Low-coverage whole-genome sequencing is suitable to detect SNPs for species with small genomes and has been used to construct genetic maps (Huang et al., 2009, 2010; Xie et al., 2010). Reduced representation sequencing methods, including reduced representation libraries (RRLs; Altshuler et al., 2000), complexity reduction of polymorphic sequences (CRoPS; Van et al., 2007), restriction site associated DNA sequencing (RAD-seq; Baird et al., 2008), multiplexed shotgun genotyping (MSG; Andolfatto et al., 2011) and GBS (Elshire et al., 2011), were designed with a focus on a particular portion of the genome. Traditional RNA-seq has also been applied for genotyping and is advantageous for detecting functional SNPs, since the sequence space is restricted to coding regions (Chepelev et al., 2009; Fu et al., 2013), but is not widely used because it is expensive and labor-intensive. The MP3RNA-seq method described here provides another effective approach for genotyping, the cost of which is similar to that of the restriction enzyme-based methods, such as GBS (Elshire et al., 2011). The genomic distributions of SNPs identified by MP3RNA-seq and RNA-seq are highly similar (r = 0.86; Figure S6). In addition, we observed a very strong overlap (over 90%) between the genes with SNPs within their 3ʹ end regions identified by MP3RNA-seq and traditional RNA-seq methods (Figure S7). Overall, MP3RNA-seq is efficient for identification of SNPs as our method is sequencing mostly on the 3ʹ untranslated region of genes, which has high SNP density in most genomes (Rafalski, 2002; 2003; Andreassen et al., 2010). Based on genotype information obtained with MP3RNA-seq, we performed QTL mapping; identified 21 QTLs for PH, EH, and EH/PH; and determined that 14 overlapped with previously identified QTLs (Table S10; Sibov et al., 2003; Yan et al., 2010; Tang et al., 2013; Park et al., 2014; Ku et al., 2015; Wang et al., 2018), demonstrating the power of MP3RNA-seq for genotyping. In addition, the estimation of transcript levels by MP3RNA-seq should be a useful source of biological information, such as for eQTL analysis as illustrated here. Transcriptome profiling may also prove helpful to narrow down the number of candidate genes during QTL fine-mapping, especially with relatively small confidence intervals.
In conclusion, MP3RNA-seq is highly reproducible, accurate, and sensitive for the quantification of gene expression and is effective for the identification of nucleotide polymorphisms between processed samples. This easy, fast, and economical method should be broadly applicable to high-throughput gene expression profiling and genotyping analysis of eukaryotic species.
MATERIALS AND METHODS
Materials and phenotyping
Seeds of Arabidopsis thaliana Columbia-0 (Col-0) after a 3-d cold treatment at 4°C were transferred to soil and incubated in a growth chamber at 22°C with a 16 h/8 h (day/night) sequence. After 3 weeks, the stem tissue was harvested, frozen immediately in liquid nitrogen and stored at −80°C before processing. The heart and liver tissues of mouse were collected from the mouse obtained from the Laboratory Animal Center of the Institute of Genetics and Development Biology, Beijing, China. Human cervical (HeLa) cells were cultured in Dulbecco's modified Eagle's media (DMEM) including 10% fetal bovine sera (FBS), 100 U/mL penicillin and 100 U/mL streptomycin at 37°C in a 5% CO2 incubator. Animal experiments were performed according to the guidelines and regulatory standards of the Institutional Animal Care and Use Committee of China Agricultural University.
A maize DH population consisting of 477 lines was derived from the hybrid between inbred lines Chang7-2 and PHBA6. The DH population and the two parents were grown in a field in 2016 at the experimental farm of China Agricultural University in Beijing, China. Two replications and the randomized complete block design were applied. The stem under shoot apical meristem at elongation stage was collected for each DH line and the two parents by manual dissection. The leaf tissues of the two parents were also collected. Collected samples were frozen immediately in liquid nitrogen, and stored at −80 ℃ before processing. Each sample was obtained from at least three plants. Phenotypic data for PH and EH were investigated as the distance from the soil surface to the top of the main tassel spike and the node of the uppermost ear, respectively. The PH and EH were measured as an average of five plants for each DH line. Relative ear height was calculated as the ratio of EH to PH.
MP3RNA-seq and RNA-seq library construction and sequencing
The MP3RNA-seq libraries were constructed with the following steps. Total RNAs of different samples were extracted using TRIzol reagent (Invitrogen) according to the manufacturer's instructions and were quantified with a NanoDrop 2000 Spectrophotometer. Approximately 300 ng of total RNA of each sample was dispensed across 96-well plates for preparing the MP3RNA-seq library. First, RQ1 RNase-Free DNase (Promega) was used to digest potential DNA contamination. DNA digestion reaction was stopped by RQ1 DNase Stop Solution (Promega). Next 10 μmol/L reverse transcription primer, containing a base “V” (V = A or C or G), anchored oligo d(T), UMIs, first barcode, and the 3ʹ Illumina sequencing adapter, was added and was incubated at 70°C for 5 min. The base of “V” was introduced after oligo d(T) to help the reverse transcription primer anchor at the Poly(A) immediately after 3ʹ end of transcripts. The resultant RNA was reverse transcribed to first-strand cDNA using Superscript III (Invitrogen), Superscript III buffer, deoxynucleoside triphosphates (dNTPs), dithiothreitol, and RNasin Plus RNase Inhibitor (Promega) with the final concentration of 5 u/μL, 1× 0.5 mmol/L/μL, 5 mmol/L/μL, and 0.5 u/μL, respectively. Reverse transcription reaction was performed in a Thermal cycler with parameters: 25℃ for 10 min, 50℃ for 50 min, 70℃ for 15 min; held at 4℃. After reverse transcription, one-quarter of the resultant first-strand cDNA of samples with different first barcodes was pooled together, treated with RNase A, and then cleaned up with VAHTS DNA Clean Beads (Vazyme). Noted, the remaining three-quarters of samples was stored as standby. We preferred to pool 48 or 96 samples together as these numbers are easy to operate at 96-well plates. Next, the second strand of cDNA was synthesized using DNA Polymerase I (NEB), RNase H (NEB), blue buffer (NEB buffer 2), and dNTPs with the final concentration of 1 u/μL, 0.1 u/μL, 1× and 0.5 mmol/L/μL, respectively. The resultant cDNA was cleaned up with VAHTS DNA Clean Beads (Vazyme) and quantified with a Qubit Fluorometric Quantitation. Then, 50 ng cDNA was used for the tagmentation with 5 μL TTE Mix V50, a commercial Tn5 (containing Illumina sequencing adapter) reagent supplied in the TruePrepTM DNA Library Prep Kit V2 for Illumina (Vazyme, TD501). The reaction is performed at 55℃ for 10 min. The cDNA used for tagmentation should be quantified accurately as a precise cDNA/Tn5 transposase ratio is crucial for obtaining optimal size of the captured 3ʹ end. Then, 3ʹ end of cDNA fragment is enriched in a Thermal cycler with parameters: 72℃ for 3 min; 98℃ for 30 s; cycled 12 × 98℃ for 15 s, 60℃ for 30 s, 72℃ for 3 min; 72℃ for 5 min; held at 4℃. Noted, the following primers were used for PCR amplification: common primer (5ʹ- AATGATACGGCGACCACCGAGATCTACACTCGTCGGCAGCGTCAG-3ʹ), index primer (5ʹ-CAAGCAGAAGACGGCATACGAGATxxxxxxGTGACTGGAGTTCAGACGTGTGC-3ʹ, “xxxxxx” represents the second barcode matching the six-base index for each library). Next, size selection of the amplified product was performed with VAHTS DNA Clean Beads (Vazyme) according to the protocol described in TruePrepTM DNA Library Prep Kit V2 for Illumina (Vazyme, TD501). The volume of VAHTS DNA Clean Beads used for the first and second round of size selection was 0.6X and 0.15X, respectively. The length of the captured 3ʹ end ranging from 200 bp to 600 bp is optimal for MP3RNA-seq. Finally, the resultant MP3RNA-seq libraries were sequenced to generate 150-nucleotide paired-end reads on an Illumina X Ten platform.
The RNA-seq libraries were prepared according the instructions of the Illumina Standard mRNA-seq library preparation kit (Illumina) and were sequenced to generate 150-nucleotide paired-end reads on an Illumina X Ten platform.
MP3RNA-seq data and RNA-seq data process and gene expression quantification
The raw sequencing reads of MP3RNA-seq were first separated into different libraries based on the 6 bp index read (second barcode), and then were further separated into different samples based on the 6 bp first barcode sequences extracted from Read 2. Reads 1 of the raw reads of maize, Arabidopsis, mouse and human samples were aligned to the maize B73 reference genome (RefGen_v4, Zea_mays.AGPv4), the Arabidopsis reference genome (TAIR10, TAIR10_GFF3_genes), the mouse reference genome (mm9, mm9_RefSeq_Genes), and the human reference genome (hg19, hg19_RefGene), respectively, using the Hisat2 (v2.0.4) software (Kim et al., 2015) with default parameters. The duplicated reads generated during the PCR amplification were removed based on the 6 bp UMI sequences extracted from Read 2. The counts of the uniquely mapped and non-PCR duplicated reads for each gene were calculated by HTseq software (https://htseq.readthedocs.io/en/master/) and then corrected by ComBat-seq software (Zhang et al., 2020) to eliminate the potential noise that might be introduced by known confounders of sequencing batch and unknown confounders (12 principal components with no genetic associations were used). The uniquely mapped and non-PCR duplicated reads corrected were further used for gene expression quantification. As only 3ʹ ends of transcripts were sequenced by MP3RNA-seq, here we used the TPM to measure gene expression level for MP3RNA-seq data. To reduce the influence of transcriptional noise, a given gene was determined to be expressed if its TPM value was greater than one.
The raw sequencing reads of RNA-seq data of maize and Arabidopsis samples were aligned to the maize B73 reference genome (RefGen_v4) and the Arabidopsis reference genome (TAIR10), respectively, using the Hisat2 (v2.0.4) software (Kim et al., 2015) with default parameters. Here we used the FPKM to measure gene expression level for RNA-seq data. Only the uniquely aligned reads were used to calculate the FPKM values for each gene with Cufflinks (v2.2.0) (Trapnell et al., 2012). To reduce the influence of transcriptional noise, a given gene was determined to be expressed if its FPKM value was greater than one.
Single nucleotide polymorphism calling and genotyping
Only the uniquely mapped and non-PCR duplicated reads were used for SNP calling with SAMtools (v0.1.16) and BCFtools (v0.1.16) (Li et al., 2009). First, RNA-seq and MP3RNA-seq data were combined for Chang7-2 and PHBA6 samples, respectively, to call SNPs. A total of 35 836 original SNPs were identified between Chang7-2 and PHBA6 with a threshold of the read depth ≥5 for both inbred lines. These original SNPs were retained for each DH line with a threshold of the read depth ≥2. Next, the SNPs with a distorted segregation ratio greater than 2/1 (Chi-test, P < 1.0E-7) or a heterozygous ratio greater than 15% were discarded. The DH lines with heterozygous ratio great than 15% were also discarded. Finally, 28,875 SNPs on 436 DH lines were retained for further genetic map construction. The missing and heterozygous SNPs were imputed by R/qtl (v1.42-8) software (Broman et al., 2003) using the argmax method. Genetic map was constructed with the est.map function of the R/qtl (v1.42-8) software (Broman et al., 2003).
Quantitative trait locus mapping
Quantitative trait locus TL mapping for gene expression level variation (eQTL mapping) was carried out using composite interval mapping (CIM) method in R/qtl (v1.42-8; Broman et al., 2003). A total of 19,429 genes which were expressed in at least 50% of the DHs with a threshold of TPM greater than one were used for eQTL mapping. The cutoff for declaring eQTL was determined by permutation test. For each permutation, 1,000 genes were randomly selected and used for eQTL mapping. A total of 100 times were repeated for the permutation, and thereby we obtained the eQTL LOD cutoff (LOD = 3.06, P = 0.05). The confidence interval of eQTL was defined by 1.5 LOD drop method. Expression quantitative trait locus was defined as a cis-eQTL if its interval contained the targeted gene, otherwise it was defined as a trans-eQTL.
Composite interval mapping method in R/qtl (v1.42-8; Broman et al., 2003) was also used for QTL mapping of the traits of PH, EH, and EH/PH. Best Linear Unbiased Predication (BLUP) was calculated for each phenotype by the lme4 package (Bates et al., 2014) across different trials, and used for subsequent analysis. Quantitative trait loci were considered significant at a LOD threshold of five, and the confidence intervals were estimated using the 1.5 LOD drop method.
trans-eQTL hotspot identification and functional enrichment analysis
We performed a sliding window approach with 1 Mb windows and 100 kb steps to identify trans-eQTL hotspot across the whole genome. A permutation test of randomly distributed trans-eQTLs along the whole genome was used to determine the threshold of trans-eQTL hotspots. For each permutation, all trans-eQTLs were randomly assigned to the windows across the whole genome and the maximum number of eQTL in a 1 Mb window was recorded. A total of 1,000 times were repeated for the permutation. Based on the distribution of the maximum number of eQTL for each permutation, we obtained the final cutoff 21 eQTL/1 Mb at a significance threshold of P = 0.05. The 1 Mb window which the trans-eQTL number was greater than the cutoff of 21 eQTL was recorded. Further, we took gene density to identify the trans-eQTL hotspots. The 1 Mb window which the trans-eQTL number was less than 1.2 times the gene number in the window was discarded. The remained windows which were adjacent or overlapped with each other were merged. Finally, we identified 117 significant trans-eQTL hotspots. The GO enrichment analysis of the target genes for each trans-eQTL hotspot was performed using AgriGO (v2.0) (Du et al., 2010). The enrichment analysis of the target genes for each trans-eQTL hotspot was analyzed based on the MaizeCyc databases (Jaiswal, 2011) with the hypergeometric test.
Identification of cis-motifs
MEME (v5.0.2) program (Bailey et al., 2009) was used to identify the overrepresented cis-motifs for the target genes belonging to the same GO term for each trans-eQTL hotspot with the following parameters: -dna -revcomp -nmotifs 10 -minw 5 -maxw 12. According to a previous report (Yu et al., 2016), the promoter regions from 1 kb upstream to 200 bp downstream of transcription start sites were used for cis-motifs identification. The cis-motifs occurring in more than 50% of the promoters were recorded. Then, we selected the top 10 cis-motifs for each gene set to determine if the cis-motifs were conserved in Arabidopsis by comparing to Arabidopsis TF databases (JASPAR, http://jaspar.genereg.net/search?q=%26collection=CORE%26tax_group=plants) by TOMTOM software (Bailey et al., 2009) with a threshold of P = 0.05.
Accession numbers
The data sets generated in this study can be found in the National Center for Biotechnology Information Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) under accession number SRP149505.
ACKNOWLEDGEMENTS
This work was supported by grants from National Natural Science Foundation of China (31421005), National Key Research and Development Program (2016YFD0100404, 2016YFD0101804), National Postdoctoral Program for Innovative Talents (BX201600149), and China Postdoctoral Science Foundation (2017M611049).
CONFLICT OF INTEREST
The authors declare no conflict of interest.
AUTHOR CONTRIBUTIONS
J.L. and J.C. designed the experiments. J.C., F.Y., X.G., W.S., and H.Z. performed the experiments. J.C., and X.Z. analyzed the data. J.C. and J.L. wrote the manuscript. All authors read and approved of its content.