New genes driven by segmental duplications share a testis-specific expression pattern in the chromosome-level genome assembly of tree sparrow
Graphical Abstract
The origination of new genes is a fundamental question in genome evolution, and gene duplication is one of the most important mechanisms for the formation of new genes (Ohno 1970; Long et al. 2003; Kaessmann 2010; Ding et al. 2012). Gene duplications can add new gene copies to the genome, which provide the raw materials for the evolution of novel gene functions and evolutionary adaptation (Crow & Wagner 2006; Magadum et al. 2013). In many cases, the duplicated gene copies are part of large duplicated chromosomal segments, while the large (>1 kbp) and highly identical (>90%) segment copies in particular chromosomal regions are referred to as segmental duplications (SDs) (Bailey et al. 2001). Owing to their high sequence identity, SDs can promote non-allelic homologous recombination; as a result, they are known as hotspots of chromosomal rearrangement and copy number variation (Bailey et al. 2004; Sharp et al. 2005; Bailey & Eichler 2006; Perry et al. 2006; Liu et al. 2011).
Although they are critical in genome evolution and plasticity, SDs may be particularly problematic to characterize at the genomic level because of their inconspicuousness, large size, and high sequence similarity, which make them frequently the last regions of genomes to be sequenced and assembled (Bailey et al. 2001; Vollger et al. 2022). Advances in long-read sequencing technologies and assembly algorithms may help to overcome the issue, and the recent generation of a complete telomere-to-telomere human genome (T2T-CHM13) has successfully demonstrated sequence resolution of complex SDs (Vollger et al. 2022). To enrich our understanding of SD organization in birds, we generated a chromosome-level genome assembly of the tree sparrow (Passer montanus), one of the most common passerine species in China, through the combination of long-read HiFi sequencing technology and Hi-C sequencing. We sequenced ∼45× HiFi reads from a male tree sparrow and assembled these reads into a 1.28 Gb genome assembly, consisting of 744 contigs with a contig N50 length of 54.42 Mb. About 1.16 Gb sequences (91.49% of the total assembly) of the assembled genome were anchored into 36 pseudo-chromosomes with the help of ∼83× Hi-C sequence data (Fig. 1a; Tables S1,S2, Supporting Information). Compared with the previously published Illumina-based assembly of tree sparrow (Qu et al. 2020), our assembly showed great improvement in continuity and completeness (Table S2, Supporting Information).

Using the high-quality assembly, we identified the SD contents and analyzed their evolutionary process. In total, we identified 61.74 Mbp of nonredundant SDs (>1 kbp in length and >90% identity), which contained 692 annotated protein-coding genes and 42 894 TEs (Figs S12,S13 and Tables S9,S10, Supporting Information). Focusing on SD regions that carry new gene copies, we detected expansion events of 54 protein-coding gene families through inter- and intrachromosomal duplications in the tree sparrow genome (Table S11, Supporting Information). Among these families, eight gene families significantly expanded in tree sparrow when compared with other avian species, including the Cys2His2 zinc finger (C2H2ZNF) protein, olfactory receptor (OR), proviral integration site for Moloney murine leukemia virus (PIM), p21-activated kinase (PAK), maestro heat-like repeat-containing protein family member (MROH), hydrocephalus-inducing protein homolog (HYDIN), heat shock factor (HSF), and inositol 1,4,5-trisphosphate receptor-interacting protein-like (ITPRIPL) (C2H2ZNF: 583, OR: 561, PIM: 450, PAK: 335, MROH: 227, HYDIN: 72, HSF: 63; ITPRIPL: 60) (Fig. S6, Supporting Information). In addition, we noticed that these members from the eight different families were always clustered and consistently adjacent to each other in chromosomes. This phenomenon indicated that the expansion events of each family were not independent during evolution, while the varied expansion scales of these families indicated the duplications also did not happen completely synchronously. Lots of members of these eight significantly expanded gene families overlapped with the identified SD blocks, and about 80% of the SD genes were members of the eight families (Figs S12,S13 and Table S11, Supporting Information), which suggested that SD was one of the most important expansion mechanisms of these eight gene families. Duplicate genes are known as major sources of genetic material and evolutionary novelty, which play a crucial role in the adaptation to different environments (Moore & Purugganan 2003; Crow & Wagner 2006; Conant & Wolfe 2008; Magadum et al. 2013; Wang et al. 2023). The additional new copies added through SDs may provide opportunities for tree sparrows to adapt to new environments.
By analyzing the genomic regions of the gene families that are related to the frequent and rapid SD events, we noticed that these eight gene families had similar chromosomal distribution patterns with long terminal repeat retrotransposons (LTR-RTs) (Fig. S14, Supporting Information). On the one side, this result indicated that insertion site preferences for LTR-RTs may exist in these families. Interestingly, the PIM, one of the eight families, has been known as a preferential proviral integration site for the Moloney murine leukemia virus (Cuypers et al. 1984). On the other side, the adjacent distributions may also indicate that transposable elements (TEs) were involved in the SD processes. The enrichments of TEs in SD regions have been widely reported in mammals (Bailey et al. 2001, 2003; Cheung et al. 2003; She et al. 2008) and insects (Fiston-Lavier et al. 2007; Zhao et al. 2013, 2017), although the enriched TEs were different in different species. Despite all this, it remained uncertain whether the LTR-RTs mediated the SDs in tree sparrows or if some other mechanisms drove the duplication events and the expansion of LTR-RTs was just the by-product of SDs.
Using the transcriptome data from different tissues (testis, spleen, lung, heart, liver, kidney, muscle, and brain) of adult tree sparrows, we compared the expression profiles in different tissues of the eight significantly expanded gene families. There is a large proportion of transcriptionally inactive genes in these expanded families, especially in OR (∼94%) and C2H2ZNF (∼89%) (Fig. S15, Supporting Information). Surprisingly, the highly transcribed genes from different families, whether located in the SD regions or not, generally exhibit testis-specific expression in the eight gene families (Fig. 1b,c). Compared with the old genes, the new gene duplicates are more prone to have testis-biased or even testis-specific expression, which has been verified in multiple species (Vinckenbosch et al. 2006; Cui et al. 2015; Kondo et al. 2017; Assis 2019; Zhang & Zhou 2019) and led to the “out of testis” hypothesis. This hypothesis posits that the promiscuous transcription in the testis and the powerful selection pressures such as sperm competition in the male germline encourage the emergence and fixation of new genes, and these new genes may be expressed and acquire new functions in other tissues later (Kaessmann 2010). The similar testis-specific expression pattern in the eight gene families with diverse structures and functions in tree sparrows is consistent with the “out of testis” hypothesis. Meanwhile, lots of amplified genes distributed in microchromosomes were also noticed to be testis-expressed in the recently published chicken complete genome (Huang et al. 2023). Therefore, we inferred that the highly expressed gene families in the testis are related to the common fate of new duplicate genes and the characteristic of avian microchromosomes.
In conclusion, the high-quality chromosome-level assembly of tree sparrows improved our knowledge about the SDs in avian species. The SD events added a large number of new copies of eight gene families into tree sparrow genomes, which effectively provided abundant raw genetic material. In addition, the testis-specific expression patterns of these new genes provided direct proof for the “out of testis” hypothesis. We hope that our study can inspire further studies and exploration of the SDs and their evolutionary consequence in other avian species.
ACKNOWLEDGMENTS
This work was supported by the National Natural Science Foundation of China (No. 31572216) and the Foundation for Excellent Doctoral Student of Gansu Province (No. 22JR5RA413). We received support for the computational work from the Supercomputing Centre of Lanzhou University. We thank Dr. Gang Song and Yanzhu Ji (Institute of Zoology, Chinese Academy of Science, Beijing, China) for their helpful comments on the manuscript.
CONFLICT OF INTEREST
The authors have no conflicts of interest to declare.
Open Research
DATA AVAILABILITY STATEMENT
All raw sequence data have been deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) (BioProject: PRJNA867105).