SPECIAL ISSUE ARTICLE

Full Access

Full-length transcriptome sequencing reveals extreme incomplete annotation of the goat genome

Huanhuan Zhang

Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China

Search for more papers by this author

Yilin Liang,

Yilin Liang

Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China

Search for more papers by this author

Shaomei Chen,

Shaomei Chen

Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, China

Search for more papers by this author

Zeyi Xuan,

Zeyi Xuan

Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, China

Search for more papers by this author

Yu Jiang,

Yu Jiang

Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China

Search for more papers by this author

Ran Li,

Corresponding Author

Ran Li

[email protected]

orcid.org/0000-0002-8584-4100

Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China

Correspondence

Ran Li, Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi 712100, China.

Email: [email protected]

Yanhong Cao, Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, Guangxi 530000, China.

Email: [email protected]

Search for more papers by this author

Yanhong Cao,

Corresponding Author

Yanhong Cao

[email protected]

Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, China

Correspondence

Ran Li, Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi 712100, China.

Email: [email protected]

Yanhong Cao, Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, Guangxi 530000, China.

Email: [email protected]

Search for more papers by this author

Huanhuan Zhang,

Huanhuan Zhang

Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China

Search for more papers by this author

Yilin Liang,

Yilin Liang

Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China

Search for more papers by this author

Shaomei Chen,

Shaomei Chen

Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, China

Search for more papers by this author

Zeyi Xuan,

Zeyi Xuan

Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, China

Search for more papers by this author

Yu Jiang,

Yu Jiang

Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China

Search for more papers by this author

Ran Li,

Corresponding Author

Ran Li

[email protected]

orcid.org/0000-0002-8584-4100

Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China

Correspondence

Ran Li, Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi 712100, China.

Email: [email protected]

Yanhong Cao, Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, Guangxi 530000, China.

Email: [email protected]

Search for more papers by this author

Yanhong Cao,

Corresponding Author

Yanhong Cao

[email protected]

Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, China

Correspondence

Ran Li, Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi 712100, China.

Email: [email protected]

Yanhong Cao, Institute of Animal Husbandry, Guangxi Vocational University of Agriculture, Nanning, Guangxi 530000, China.

Email: [email protected]

Search for more papers by this author

First published: 27 February 2023

https://doi.org/10.1111/age.13311

Citations: 1

Share a link

Email
Wechat
Bluesky

Abstract

Despite recent advances in generating high-quality reference genome assemblies, the genome sequences for most livestock species, including goats, are still poorly annotated. Single-molecule long-read sequencing has greatly facilitated gene annotation by obtaining full-length transcripts. In this study, we generated full-length transcriptome data for samples from abomasum (n = 2) and testicle (n = 1), using PacBio Iso-Seq technology. We further combined these data with published data from abomasum (5ZY, SRR8618141) to evaluate and improve the gene annotation of the goat genome. We identified 14.5–16.3% of novel genes per sample from the four Iso-Seq datasets. At the transcript level, 40.6% of them were novel, including 29.7% novel transcripts from known genes and 10.9% from novel genes. We further verified the expression of novel genes in four additional RNA-seq data and found that the expression level of novel genes was significantly lower than that of known genes, indicating that the lowly expressed genes tend to be missed in the current genome annotation. This study shows the superiority of full-length transcriptome data in gene annotation, and more such data are required to improve the gene annotation for goat genome and other species.

The goat, with a long history of domestication, provides economic production for humans as an important source of meat, milk and fur (Boyazoglu et al., 2005). A high-quality goat reference genome was assembled in 2017 using single-molecule sequencing, which provides a good start for the molecular breeding research and applications at the genome level of this species (Bickhart et al., 2017). The latest Ensembl release 108.1 annotated 21 361 genes with 41 794 transcripts in the goat genome compared with 19 813 coding genes of 252 477 transcripts in the human genome (Ensembl release 108.38). Given that the goat and human genomes are roughly 3 GB in size, the much lower number of annotated genes in goat suggests extreme incompleteness in the gene annotation. To investigate the genetic basis and molecular regulatory mechanisms of economically important goat traits, complete and accurate gene annotation is required in addition to high-quality genome assembly sequences.

The transcriptome is direct evidence for gene annotation, but traditional RNA-seq technology based on short-read lengths (within 150 bp) cannot directly obtain full-length transcripts and alternative splicing information. The emergence of full-length transcriptome sequencing, represented by both the PacBio and Nanopore platforms, overcomes the shortcomings of RNA-seq and allows direct access to the full-length information of transcripts (Nudelman et al., 2018). The full-length transcriptome provides an important way to improve the quality of genome annotation, which has been studied in a variety of plant and animal species (Ali et al., 2021; Feng et al., 2019; Zhang et al., 2022). Recent studies have shown that a full-length transcriptome enables the identification of novel genes and lncRNAs, the identification of full-length splice isoforms and the detection of novel alternative splicing (AS) and alternative polyadenylation events (Grabherr et al., 2011; He et al., 2022).

In order to explore the extent to which a full-length transcriptome can improve the annotation of the goat genome, we analyzed PacBio Iso-Seq data from three abomasum and one testis sample, of which three were generated in this study. By comparing with the reference genome annotation, a large number of novel genes and novel AS events were identified, necessitating the application of a full-length transcriptome for goat genome annotation.

We performed PacBio Iso-Seq for two abomasum samples (7ZY, 8ZY) and one testicle (GW) from Chinese Cashmere goats. One additional sample (5ZY) was also downloaded from the NCBI SRA database (SRR8618141) (Zheng et al., 2020). The total RNA from each sample was extracted using Trizol reagent and then sequenced using PacBio Sequel. Pacbio SMRTbell libraries were made and sequenced on two separate SMRT cells (Annoroad Gene Technology Co. Ltd). A total of 51 010 175 subreads were generated, with an average length of 1835 nt.

High-quality circular consensus sequences (CCSs) were obtained using the Isoseq3 application in the pacbio smrt analysis v6.0.0 softward (https://github.com/PacificBiosciences/IsoSeq).

The CCSs of three abomasum samples showed a peak near 2200 bp, while the testicle sample showed a peak near 1600 bp. The CCSs of three abomasum samples displayed a peak around 2200 bp while the testicle sample showed a peak at 1600 bp. The testicle sample generated much shorter CCSs than the abomasum samples, probably owing to technical bias; however, this sample still had a similar number of CCSs above 2000 bp as the abomasum samples (Table 1, Figure 1a). Therefore, the number and average length of assembled transcripts from the testicle sample were also similar to those of the three studied abomasum samples. We classified these CCSs as full-length (FL) or non-full-length (nFL) transcript sequences using the isoseq3 software based on whether they had a 5′ or a 3′ cDNA primer and a poly(A) tail. To improve the PacBio sequencing accuracy, we chose only full-length reads. We then clustered the sequences at the isoform level using isoseq3 software, and the clustered transcripts were assembled into a complete, consistent sequence using the polish module of the software.

TABLE 1. Summary of detected genes and transcripts in each tissue.

Sample_id	Tissue	Number of CCSs	Number of detected genes	Number of novel genes	Number of detected transcripts	Average length of per transcript (bp)	Average number of exons per transcript
5ZY	Abomasum	26 073	9565	1446	20 782	2488.1	9.1
7ZY	Abomasum	24 482	10 378	1560	21 962	2217.6	8.1
8ZY	Abomasum	24 877	10 526	1524	21 198	2247.3	8.7
GW	Testicle	37 716	10 404	1697	22 458	2300.6	8.5

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Overview of annotated transcripts and genes. (a) Circular consensus sequence lengths of the four samples; (b) classification of transcripts; (c) prediction of alternative splicing for the three types of isoforms; and (d) expression of known and novel genes in four tissues of goat bone, fat, muscle and skin.

We identified 1 171 911 complete full-length sequences, of which 113 148 were high-quality isoforms (HQ isoforms, accuracy ≥99% and ≥2 FL reads support), and the rest were considered low-quality isoforms (LQ isoforms, accuracy <99% or only one FL read support), which were discarded. These HQ isoforms were aligned to the latest goat reference genome (ARS1, GCA_001704415.1) using minimap2 (Li, 2016). The collapse_isoforms_by_sam.py tool from cdna_cupcake software (https://github.com/Magdoll/cDNA_Cupcake) was used to filter and remove redundancy from the matching results with a minimum identity of 90% and a minimum coverage of 85%.

Novel genes and novel transcripts were detected using sqanti3_qc.py from the sqanti3 software (Tardaguila et al., 2018) from the non-redundant sequences, and then false positive transcripts were filtered using sqanti3_RulesFilter.py. A gene locus is defined as a novel gene if it does not overlap with a known gene. An isoform is defined as a novel transcript if three ‘clipping site changes or new introns (exons) emerge as compared with known transcripts’. In this way, we classified the transcripts into three categories: known transcripts from known genes; novel transcripts from known genes; and novel transcripts from novel genes. Approximately 10 000 genes were identified in each sample, of which 14.5–16.3% were novel genes, with testicular tissue having a higher number of novel genes than the others. Each gene had 2.1 transcripts on average, and the transcript harbored 8.1–9.1 exons. Among the three abomasum samples, 5ZY detected a slightly lower number of genes and transcripts than 7ZY and 8ZY, but with a longer transcript length (Table 1). Approximately 40.6% of the transcripts were novel, including 29.7% novel transcripts from known genes and 10.9% from novel genes (Figure 1b). This indicated that a large proportion of the goat genome remains poorly annotated.

All transcripts were identified using suppa2 software (https://github.com/comprna/SUPPA) for AS, including skipped exon (SE), alternative 5′ splice site (A5), alternative 3′ splice site (A5), alternative 3′ splice site (A5), alternative 3′ splice site (A3), mutually exclusive exons (MX), retained intron (RI), alternative first exons (AF) and alternative last exons (AL). AF was most prevalent, accounting for 39.4% of AS events, followed by SE (19.7%) and A5 (13.1%) (Figure 1c).

To validate the Iso-Seq sequencing data and the reliability of identified novel genes, we obtained additional genomic data from four goat tissues (bone, muscle, fat and skin), from the NCBI SRA database (PRJNA485657) (Pan et al., 2021), by quantifying their expression levels. To obtain high-quality reads, quality and length filtering of the raw sequencing reads were performed using the fastp software (Chen et al., 2018). High-quality reads were then mapped to the reference genome generated from full-length transcripts using star software (Dobin et al., 2013). The expression matrix of the annotated genes was calculated in fragments per kilobase of the exon model per million mapped fragments (FPKM) using featurecounts (Mortazavi et al., 2008). The majority (84.8%) of novel genes were found to be expressed in at least one sample with FPKM >1, suggesting that most of the identified novel genes were indeed present and demonstrated a considerable expression level. Meanwhile, for each of the four tissues, the overall expression of novel genes is significantly lower than that of known genes (Figure 1d), which could partially explain why they are currently not annotated.

This study identified abundant novel genes and novel alternative splicing sites that had not yet been annotated in the goat genome using four PacBio Iso-Seq data. The results showed that the current annotation of the goat genome is still very incomplete and far from representing the richness of the transcriptome. However, transcriptome profiling using long-read sequencing studies is still limited in livestock species (Beiki et al., 2019; Rosen et al., 2020; Warr et al., 2020; Yuan et al., 2021). Even in this study, we only have four Iso-Seq data available, which is insufficient for comprehensive gene annotation. In the future, more full-length transcriptome sequencing in various goat tissues will be needed to obtain accurate and complete genome annotation information, which will be a fundamental resource for functional genomics studies.

ACKNOWLEDGEMENTS

We thank the High-Performance Computing platform of Northwest A&F University for providing computing resources.

FUNDING INFORMATION

This study was supported by research grants from the National Natural Science Foundation of China to RL (31802027) and Major Projects of Science and Technology in Guangxi to YHC (GuikeAA18118041).

CONFLICT OF INTEREST STATEMENT

Authors declare no conflict of interest.

Open Research

DATA AVAILABILITY STATEMENT

The data was uploaded to the NCBI SRA database with the accession numbers SRR11410765, SRR22542638 and SRR22542639.

REFERENCES

Ali, A., Thorgaard, G.H. & Salem, M. (2021) PacBio Iso-Seq improves the rainbow trout genome annotation and identifies alternative splicing associated with economically important phenotypes. Frontiers in Genetics, 12, 683408. Available from: https://doi.org/10.3389/fgene.2021.683408
10.3389/fgene.2021.683408
CAS PubMed Web of Science® Google Scholar
Beiki, H., Liu, H., Huang, J., Manchanda, N., Nonneman, D., Smith, T.P.L. et al. (2019) Improved annotation of the domestic pig genome through integration of Iso-Seq and RNA-seq data. BMC Genomics, 20(1), 344. Available from: https://doi.org/10.1186/s12864-019-5709-y
10.1186/s12864-019-5709-y
CAS PubMed Google Scholar
Bickhart, D.M., Rosen, B.D., Koren, S., Sayre, B.L., Hastie, A.R., Chan, S. et al. (2017) Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nature Genetics, 49(4), 643–650. Available from: https://doi.org/10.1038/ng.3802
10.1038/ng.3802
CAS PubMed Web of Science® Google Scholar
Boyazoglu, J., Hatziminaoglou, I. & Morand-Fehr, P. (2005) The role of the goat in society: past, present and perspectives for the future. Small Ruminant Research, 60, 13–23. Available from: https://doi.org/10.1016/j.smallrumres.2005.06.003
10.1016/j.smallrumres.2005.06.003
Web of Science® Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. (2018) Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. Available from: https://doi.org/10.1093/bioinformatics/bty560
10.1093/bioinformatics/bty560
PubMed Web of Science® Google Scholar
Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S. et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21. Available from: https://doi.org/10.1093/bioinformatics/bts635
10.1093/bioinformatics/bts635
CAS PubMed Web of Science® Google Scholar
Feng, S., Xu, M., Liu, F., Cui, C. & Zhou, B. (2019) Reconstruction of the full-length transcriptome atlas using PacBio Iso-Seq provides insight into the alternative splicing in Gossypium australe. BMC Plant Biology, 19(1), 365. Available from: https://doi.org/10.1186/s12870-019-1968-7
10.1186/s12870-019-1968-7
PubMed Web of Science® Google Scholar
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I. et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29(7), 644–652. Available from: https://doi.org/10.1038/nbt.1883
10.1038/nbt.1883
CAS PubMed Web of Science® Google Scholar
He, W., Zhang, X., Lv, P., Wang, W., Wang, J., He, Y. et al. (2022) Full-length transcriptome reconstruction reveals genetic differences in hybrids of Oryza sativa and Oryza punctata with different ploidy and genome compositions. BMC Plant Biology, 22(1), 131. Available from: https://doi.org/10.1186/s12870-022-03502-2
10.1186/s12870-022-03502-2
CAS PubMed Web of Science® Google Scholar
Li, H. (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14), 2103–2110. Available from: https://doi.org/10.1093/bioinformatics/btw152
10.1093/bioinformatics/btw152
CAS PubMed Web of Science® Google Scholar
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7), 621–628. Available from: https://doi.org/10.1038/nmeth.1226
10.1038/nmeth.1226
CAS PubMed Web of Science® Google Scholar
Nudelman, G., Frasca, A., Kent, B., Sadler, K.C., Sealfon, S.C., Walsh, M.J. et al. (2018) High resolution annotation of zebrafish transcriptome using long-read sequencing. Genome Research, 28(9), 1415–1425. Available from: https://doi.org/10.1101/gr.223586.117
10.1101/gr.223586.117
CAS PubMed Web of Science® Google Scholar
Pan, X., Li, Z., Li, B., Zhao, C., Wang, Y., Chen, Y. et al. (2021) Dynamics of rumen gene expression, microbiome colonization, and their interplay in goats. BMC Genomics, 22(1), 288. Available from: https://doi.org/10.1186/s12864-021-07595-1
10.1186/s12864-021-07595-1
CAS PubMed Web of Science® Google Scholar
Rosen, B.D., Bickhart, D.M., Schnabel, R.D., Koren, S., Elsik, C.G., Tseng, E. et al. (2020) De novo assembly of the cattle reference genome with single-molecule sequencing. Gigascience, 9(3), giaa021. Available from: https://doi.org/10.1093/gigascience/giaa021
10.1093/gigascience/giaa021
CAS PubMed Web of Science® Google Scholar
Tardaguila, M., de la Fuente, L., Marti, C., Pereira, C., Pardo-Palacios, F.J., Del Risco, H. et al. (2018) SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Research, 28(3), 396–411. Available from: https://doi.org/10.1101/gr.222976.117
10.1101/gr.222976.117
CAS PubMed Web of Science® Google Scholar
Warr, A., Affara, N., Aken, B., Beiki, H., Bickhart, D.M., Billis, K. et al. (2020) An improved pig reference genome sequence to enable pig genetics and genomics research. Gigascience, 9(6), giaa051. Available from: https://doi.org/10.1093/gigascience/giaa051
10.1093/gigascience/giaa051
PubMed Web of Science® Google Scholar
Yuan, Z., Ge, L., Sun, J., Zhang, W., Wang, S., Cao, X. et al. (2021) Integrative analysis of Iso-Seq and RNA-seq data reveals transcriptome complexity and differentially expressed transcripts in sheep tail fat. PeerJ, 9, e12454. Available from: https://doi.org/10.7717/peerj.12454
10.7717/peerj.12454
PubMed Web of Science® Google Scholar
Zhang, R., Kuo, R., Coulter, M., Calixto, C.P.G., Entizne, J.C., Guo, W. et al. (2022) A high-resolution single-molecule sequencing-based Arabidopsis transcriptome using novel methods of Iso-seq analysis. Genome Biology, 23(1), 149. Available from: https://doi.org/10.1186/s13059-022-02711-0
10.1186/s13059-022-02711-0
CAS PubMed Web of Science® Google Scholar
Zheng, Z., Wang, X., Li, M., Li, Y., Yang, Z., Wang, X. et al. (2020) The origin of domestication genes in goats. Science Advances, 6(21), eaaz5216. Available from: https://doi.org/10.1126/sciadv.aaz5216
10.1126/sciadv.aaz5216
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume54, Issue4

August 2023

Pages 421-424

This article also appears in:

Functional Genomics and Annotation of Domestic Animals

Full-length transcriptome sequencing reveals extreme incomplete annotation of the goat genome

Abstract

ACKNOWLEDGEMENTS

FUNDING INFORMATION

CONFLICT OF INTEREST STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Full-length transcriptome sequencing reveals extreme incomplete annotation of the goat genome

Abstract

ACKNOWLEDGEMENTS

FUNDING INFORMATION

CONFLICT OF INTEREST STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

REFERENCES

Citing Literature

Figures

References

Related

Information