Volume 21, Issue 1 pp. 238-250
RESOURCE ARTICLE
Full Access

A chromosome-scale assembly of the black gram (Vigna mungo) genome

Wirulda Pootakham

Corresponding Author

Wirulda Pootakham

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Correspondence

Wirulda Pootakham and Sithichoke Tangphatsornruang, National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand.

Emails: [email protected](WP); [email protected](ST)

Search for more papers by this author
Wanapinun Nawae

Wanapinun Nawae

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Search for more papers by this author
Chaiwat Naktang

Chaiwat Naktang

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Search for more papers by this author
Chutima Sonthirod

Chutima Sonthirod

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Search for more papers by this author
Thippawan Yoocha

Thippawan Yoocha

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Search for more papers by this author
Wasitthee Kongkachana

Wasitthee Kongkachana

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Search for more papers by this author
Duangjai Sangsrakru

Duangjai Sangsrakru

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Search for more papers by this author
Nukoon Jomchai

Nukoon Jomchai

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Search for more papers by this author
Sonicha U-thoomporn

Sonicha U-thoomporn

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Search for more papers by this author
Prakit Somta

Prakit Somta

Department of Agronomy, Faculty of Agriculture at Kamphaeng Saen, Kasetsart University, Nakhon Pathom, Thailand

Search for more papers by this author
Kularb Laosatit

Kularb Laosatit

Department of Agronomy, Faculty of Agriculture at Kamphaeng Saen, Kasetsart University, Nakhon Pathom, Thailand

Search for more papers by this author
Sithichoke Tangphatsornruang

Corresponding Author

Sithichoke Tangphatsornruang

National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand

Correspondence

Wirulda Pootakham and Sithichoke Tangphatsornruang, National Omics Center, National Science and Technology Development Agency (NSTDA), Pathum Thani, Thailand.

Emails: [email protected](WP); [email protected](ST)

Search for more papers by this author
First published: 13 August 2020
Citations: 43

Nawae, Naktang and Sonthirod equal contributors.

Abstract

Black gram (Vigna mungo) is an important short duration grain legume crop. Black gram seeds provide an inexpensive source of dietary protein. Here, we applied the 10X Genomics linked-read technology to obtain a de novo whole genome assembly of V. mungo cultivated variety Chai Nat 80 (CN80). The preliminary assembly contained 12,228 contigs and had an N50 length of 5.2 Mb. Subsequent scaffolding using the long-range Chicago and HiC techniques yielded the first high-quality, chromosome-level assembly of 499 Mb comprising 11 pseudomolecules. Comparative genomics analyses based on sequence information from single-copy orthologous genes revealed that black gram and mungbean (Vigna radiata) diverged about 2.7 million years ago . The transversion rate (4DTv) analysis in V. mungo revealed no evidence supporting a recent genome-wide duplication event observed in the tetraploid créole bean (Vigna reflexo-pilosa). The proportion of repetitive elements in the black gram genome is slightly lower than the numbers reported for related Vigna species. The majority of long terminal repeat retrotransposons appeared to integrate into the genome within the last five million years. We also examined alternative splicing events in V. mungo using full-length transcript sequences. While intron retention was the most prevalent mode of alternative splicing in several plant species, alternative 3' acceptor site selection represented the majority of events in black gram. Our high-quality genome assembly along with the genomic variation information from the germplasm provides valuable resources for accelerating the development of elite varieties through marker-assisted breeding and for future comparative genomics and phylogenetic studies in legume species.

1 INTRODUCTION

Black gram (Vigna mungo [L.] Hepper) is an important short duration grain legume crop with high protein content in seeds. Black gram is a self-pollinating diploid species (2n = 2x = 22) with an estimated genome size of 574 Mb (Arumuganathan & Earle, 1991). Black gram seeds are an inexpensive source of dietary protein, starch, vitamins and mineral elements,containing a high level of folate and iron (Kakati, Deka, Kotoki, & Saikia, 2010). V.mungo var. mungo (L.) Hepper appeared to have been domesticated in India from its wild progenitor, V. mungo var. silvestris (Chandel, Lester, & Starling, 1984). This pulse crop is widely cultivated in South and Southeast Asian countries including India, Bangladesh, Pakistan, Sri Lanka, Myanmar, the Philippines and Thailand (Kaewwongwal et al., 2015). India is the world's largest producer of black gram followed by Myanmar and Pakistan (Kaewwongwal et al., 2015; Raizada & Souframanien, 2019). To date, genetic improvement in black gram has been achieved primarily through conventional breeding as marker-assisted approaches are still in their infancy. While a decent yield improvement has been made, the average yield per hectare is still low due to losses from biotic (e.g., powdery mildew and yellow mosaic disease) and abiotic (e.g., drought and salinity) stresses.

Over the past few years, several studies have utilized Illumina short-read sequencing technology to obtain transcriptome assemblies for the purpose of developing simple sequence repeat (SSR) and single nucleotide polymorphism (SNP) markers in black gram (Jasrotia et al., 2017; Raizada & Souframanien, 2019; Souframanien & Reddy, 2015). Nevertheless, there has not been a report on a genome assembly for this legume species. Here, we employed the 10X Genomics linked-read technology (Paajanen et al., 2019) to perform de novo genome assembly of V. mungo. We also applied the long-range Chicago (in vitro proximity ligation) and HiC (in vivo fixation of chromosomes) techniques (Putnam et al., 2016) to obtain the first chromosome-scale whole genome assembly for this species. A combination of the 10X Genomics linked-read technology and the long-range HiC scaffolding technique provides an effective approach to produce a high-quality reference assembly. Along with the genomic variation information from V. mungo germplasm, this genome assembly provides invaluable resources for accelerating the development of improved elite black gram varieties through molecular breeding and future phylogenetics and comparative genomics studies in Vigna species.

2 MATERIALS AND METHODS

2.1 Plant materials and DNA/RNA extraction

Ninety black gram accessions maintained at Kasetsart University (Thailand) were used in this study. For DNA extraction, fresh leaf tissues were collected, flash-frozen in liquid nitrogen and stored at –80°C until use. To obtain high molecular weight DNA for 10X Genomics linked-read sequencing, frozen tissues (cultivar Chai Nat 80 [CN80]) were homogenized, and DNA was extracted using QIAGEN Genomic-tip 100/G following the manufacturer's protocol (Qiagen). The DNA integrity was assessed using the Pippin Pulse Electrophoresis System (Sage Science). Total RNA was isolated from the following tissues (CN80): leaf, root, stem, flower, 1-week-old pod and 3-week-old pod using the CTAB buffer (2% CTAB, 1.4 M NaCl, 2% PVP, 20 mM EDTA pH 8.0, 100 mM Tris-HCl pH 8.0, 0.4% SDS). The aqueous phase was extracted three times using 25:24:1 phenol:chloroform:isoamylalcohol, and RNA was precipitated overnight in ¼ volumes of 8 M LiCl. The pellets were washed with 70% ethanol, air-dried and resuspended in RNase-free water. Poly(A) mRNAs were enriched from total RNA samples using Dynabeads mRNA Purification Kit (ThermoFisher Scientific).

2.2 DNA and RNA library preparation and sequencing

A total of 1.25 ng of high molecular weight DNA was used to prepare the linked-read library using the Chromium Genome Library Kit & Gel Bead Kit v2, the Chromium Genome Chip Kit v2 and the Chromium i7 Multiplex Kit according to the manufacturer's instructions (10X Genomics). The resulting 10X library was sequenced on a single lane of Illumina HiSeq X Ten (2 × 150 bp paired-end reads). For whole genome shotgun sequencing of 89 V. mungo accessions (Table S1), approximately 300 ng of each DNA sample was used for a library construction following the protocol in the MGIEazy FS Library Prep Kit (MGI Tech). Paired-end (150 bp) sequencing was performed on the MGISEQ-2000RS according to the manufacturer's instructions.

RNA integrity was assessed with a Fragment Analyser System (Agilent) prior to the construction of RNA sequencing libraries. Two Iso-seq libraries were prepared according to a previously published protocol (Pootakham et al., 2017) using the SMARTer PCR cDNA Synthesis Kit (Clontech) and size-selected using the BluePippin Size Selection System (Sage Science) into 1–2 kb, 2–3 kb and 3–6 kb bins. One library was prepared from RNA extracted from leaf tissue, and the other was prepared from pooled RNA samples (root, stem, flower, 1-week-old pod and 3-week-old pod). Sequencing was performed on the PacBio RSII sequencing system using P6-C4 polymerase and chemistry and 360 min movie times according to the manufacturer's protocol. To obtain short-read RNA sequences, six RNA libraries (one for each tissue type) were prepared according to the protocol reported in Pootakham et al. (2018). Briefly, 200 ng of poly(A) mRNA was used to construct a library using the Ion Total RNA Sequencing Kit (ThermoFisher Scientific). The libraries were sequenced on the Ion S5 XL using the Ion 540 chip (ThermoFisher Scientific).

2.3 Chicago library preparation and sequencing

Chicago library preparation and sequencing were carried out by Dovetail Genomics. A Chicago library was prepared as described previously (Putnam et al., 2016). Briefly, ~500 ng of high molecular weight genomic DNA (mean fragment length = 61 kbp) was reconstituted into chromatin in vitro and fixed with formaldehyde. Fixed chromatin was digested with DpnII, the 5' overhangs filled in with biotinylated nucleotides, and then free blunt ends were ligated. After ligation, crosslinks were reversed and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~350 bp mean fragment size, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq X to produce 103 million 2 × 150 bp paired-end reads, which provided 21.72X physical coverage of the genome (1–100 kb pairs).

2.4 Dovetail HiC library preparation and sequencing

Dovetail HiC library preparation and sequencing were carried out by Dovetail Genomics. A Dovetail HiC library was prepared in a similar manner as described previously (Lieberman-Aiden et al., 2009). Briefly, for each library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted fixed chromatin was digested with DpnII, the 5’ overhangs filled in with biotinylated nucleotides, and then free blunt ends were ligated. After ligation, crosslinks were reversed and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~ 350 bp mean fragment size and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq X to produce 86 million 2 x 150 bp paired-end reads, which provided 2,521.27X physical coverage of the genome (10–10,000 kb pairs).

2.5 De novo genome assembly

Linked-read data were assembled using the Supernova assembler version 2.1.1 using the default settings (https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/using/running; 10X Genomics). The Supernova scaffolds along with Chicago library reads, shotgun reads, and Dovetail HiC library reads were used as the input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al., 2016). An iterative analysis was conducted. First, shotgun and Chicago library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of Chicago read pairs mapped within draft scaffolds were analysed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and make joins above a threshold. After aligning and scaffolding Chicago data, Dovetail HiC library sequences were aligned and scaffolded following the same method. After scaffolding, shotgun sequences were used to close gaps between contigs.

2.6 Assembly quality assessment

The assembly quality assessment was carried out by aligning short-read DNA and RNA-seq sequences, Iso-seq transcript sequences and publicly available genomic and transcriptomic sequences (Kundu, Patel, Patel, & Pal, 2015; Kundu, Singh, Dey, Ganguli, & Pal, 2019) using BLASTN at an e-value cutoff of 10–10. The completeness of the final genome assembly was also evaluated using Benchmarking Universal Single-Copy Orthologues (BUSCO) (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). The BUSCO pipeline version 3 was used to test for the presence and completeness of orthologues using the Embryophyta OrthoDB release 9 (Kriventseva et al., 2015).

2.7 Annotation of repetitive elements and repeat masking

To generate a de novo repeat library, RepeatModeler version 2.0.1 (http://www.repeatmasker.org/RepeatModeler/) was used to predict transposable elements in the unannotated genome assembly. Two de novo repeat-finding programs, RECON version 1.08 and RepeatScout version 1.0.5, were employed to identify the boundaries of repetitive elements and to build consensus models of interspersed repeats. To mask the assembled genome sequences, we employed both the custom black gram-specific repeat library generated by RepeatModeler and the repetitive sequences in the RepBase plant repeat database (20,150,807; https://www.girinst.org/) using RepeatMasker version 4.0.9_p2 (default parameters) (Tempel, 2012). The insertion times for the full-length LTR retrotransposons were estimated using the LTR_retriever program (Ou & Jiang, 2018). The insertion times of LTRs (T) were calculated according to the following formula: T = K/2μ, where K is the divergence rate calculated with the Jukes-Cantor model (Jukes & Cantor, 1969) for non-coding sequences, and μ is the 1.64 × 10–8 substitution rate in repeat regions (Zhuang et al., 2019). The Jukes-Cantor model assumes equal base frequencies and equal mutation rates. We also assumed that the mutation rates were similar among species analysed.

2.8 Gene annotation

To predict protein-coding sequences in the unmasked genomes, evidences from transcriptome-based prediction, gene prediction and homology-based prediction were combined using EvidenceModeler (EVM) version 1.1.1 r2015-07–03 (Haas et al., 2008). Transcriptome-based prediction methods combined information from RNA-seq and Ise-seq data obtained from leaf, root, stem, flower and pod. Full-length transcripts were mapped to the genome assembly using Genomic Mapping and Alignment Program (GMAP; version r20160630) (Wu & Watanabe, 2005), and short-read RNA-seq data were mapped to the assembly using PASA2 version 2.0.1 (Haas et al., 2008). Protein sequences from Vigna radiata (mungbean), Vigna angularis (adzuki bean), Vigna unguiculata (cowpea), Phaseolus vulgaris (common bean), Glycine max (soybean) and Arabidopsis thaliana obtained from public databases were aligned to the unmasked genome using AAT version 1.52 (Huang, Adams, Zhou, & Kerlavage, 1997). Two ab initio gene predictors were run on the unmasked assembly. Protein-coding gene predictions were obtained with Augustus version 3.2.1 (Stanke, Steinkamp, Waack, & Morgenstern, 2004) trained with V. radiata, V. angularis, V. unguiculata, P. vulgaris, G. max and A. thaliana PASA transcriptome alignment assembly and BRAKER (Hoff, Lange, Lomsadze, Borodovsky, & Stanke, 2016; Hoff, Lomsadze, Borodovsky, & Stanke, 2019) using an Iso-seq alignment file as an input. All gene predictions were integrated by EVM to generate consensus gene models using the following weight for each evidence type: PASA2–1, GMAP–1, AAT–0.3 and Augustus–0.3. The positions of annotate genes were cross-checked with those of known repeats, and any gene that had more than 20% overlapping sequence with repetitive elements was excluded from the list of annotated genes. Predicted genes were functionally annotated using OmicsBox version 1.3.11 (https://www.biobam.com/download-omicsbox/). Protein sequences were aligned with the following protein databases: UniProtKB/Swiss-Prot (swissprot v5) and GenBank nonredundant database (nr v5) using local BLASTP with an e-value cutoff of 1.0e-5. Gene ontology (GO) terms were retrieved and assigned to V. mungo query sequences. Enzyme codes, corresponding to V. mungo gene ontology, were retrieved and map to KEGG pathway annotations.

2.9 Comparative genomics and phylogenetic analysis

We used OrthoFinder (Emms & Kelly, 2019) to identify orthologous groups in V. mungo and nine other legumes (G. max, V. unguiculata, V. angularis, V. radiata, Vigna reflexo-pilosa, P. vulgaris, Arachis duranensis, Cicer arietinum and Medicago truncatula), two cuburbit species (Cucumis sativus and Cucumis melo), two rosid species (Prunus persica and A. thaliana) and one monocot (Oryza sativa). Protein sequences from single-copy orthologous groups were used to construct phylogenetic tree using the RAxML-ng program (Kozlov, Darriba, Flouri, Morel, & Stamatakis, 2019). We first aligned sequences in each single-copy orthologous group with MUSCLE (Edgar, 2004) and removed alignment gaps with trimAl (Capella-Gutiérrez, Silla-Martínez, & Gabaldón, 2009) using the automated1 heuristic method. All alignment blocks were concatenated using catsequences program (https://github.com/ChrisCreevey/catsequences), and substitution model for each alignment block was estimated using ModelTest-NG program (Darriba et al., 2019). The outputs were subsequently used to compute a maximum-likelihood phylogenetic tree. Divergence times of species in the phylogenetic tree were estimated with the MCMCtree program (PAML4 package) (Yang, 2007) using the relaxed-clock model with the known divergence time between Phaseolus and Vigna, estimated at 6.4–10.4 million years ago (MYA) (Lavin, Herendeen, & Wojciechowski, 2005) and the known divergence time between cucumber and melon, estimated at 8.4–11.8 MYA (Sebastian, Schaefer, Telford, & Renner, 2010). Significantly expanded or contracted gene families across the phylogenetic tree (p-value <.01) were calculated using the CAFE software version 4.2 (Han, Thomas, Lugo-Martinez, & Hahn, 2013) with the gene birth-death (λ) parameters estimated using the maximum-likelihood method.

2.10 Phaseoloid evolutionary analysis

We adapted the method described in Ren, Huang, and Cannon (2019) to reconstruct the ancestral genome of V. mungo, V. radiata, V. unguiculata, V. angularis, Pvulgaris and G. max (Ren et al., 2019). In brief, we used OrthoFinder to identify orthologous groups in these six species. Syntenic blocks were then constructed from the orthologous groups using the DAGchainer program (Haas, Delcher, Wortman, & Salzberg, 2004) in the Synima pipeline (Farrer, 2017). The outputs from DAGchainer were used to specify “markers”representing features that were shared by the selected genomes using the scripts provided in Ren et al. (2019). In the following step, we used MLGO web service (http://www.geneorder.org/server.php) (Hu, Lin, & Tang, 2014) to infer the ancestral genome from the order of the markers in each individual genome and the information from the phylogenetic tree constructed using single-copy orthologues (Kozlov et al., 2019).

2.11 V. mungo phylogenetic relationship and population structure analysis

For the phylogenetic analysis, we used a set of 6,657 SNP markers at four-fold-degenerate sites with the following criteria: (a) a minor allele frequency >0.05; (b) depth coverage between 20X–200X; and (c) fewer than 10% missing data. R package was used to construct a neighbour-joining tree with 1,000 bootstrap replicates (Paradis, Claude, & Strimmer, 2004; R Core Team, 2016).We applied the same set of SNPs to examine the population structure using STRUCTURE program (version 2.3.4) (Falush, Stephens, & Pritchard, 2003) using 10,000 iterations with the number of clusters (K) of 2–4. The Evanno method was used to detect the number of K groups that best fitted the data set (Earl & vonHoldt, 2012; Evanno, Regnaut, & Goudet, 2005).

3 RESULTS

3.1 Genome assembly and annotation

An elite cultivar, Chai Nat 80, was selected for whole-genome shotgun sequencing using the 10X Genomics linked-read strategy. We generated a total of 133 Gb of Illumina paired-end 150 bp sequencing data from 892,194,918 raw reads, representing 232X coverage based on the estimated genome size of 574 Mb (Arumuganathan & Earle, 1991; Pal, 2006). A de novo assembly of linked-read sequences using the Supernova yielded a draft genome of 498.9 Mb. The preliminary assembly contained 12,228 contigs and had an N50 length of 5.2 Mb (Table 1). The analysis of k-mer distribution of the genome sequencing reads provided an estimated genome size of 531.3 Mb (Figure S1), close to the previously reported figure (Pal, 2006). The preliminary assembly of V. mungo genome was further assembled using the long-range Chicago (in vitro proximity ligation; 103 million read pairs; 21.72X physical coverage of the genome) and HiC (in vivo fixation of chromosomes; 86 million read pairs; 2,521.27X physical coverage of the genome) library data scaffolded with the HiRise software (Dovetail Genomics, Santa Cruz, USA; Figure S2). The final assembly contained 11 pseudomolecules greater than 10 Mb in length (hereafter referred to as chromosomes, numbered according to size; Figure 1), corresponding to the haploid chromosome number in V. mungo (1n = 11, 2n = 22). The 11 chromosomes covered 463,352,435 bases or 92.8% of the 499 Mb assembly.

TABLE 1. Assembly statistics of the V. mungo genome
10X Genomics

10X Genomics

+ Chicago

10X Genomics

+ Chicago

+ HiC

N50 scaffold size (bases) 5,235,471 98,299 43,171,434
L50 scaffold number 30 131 5
N75 scaffold size (bases) 2,307,364 331 35,890,399
L75 scaffold number 68 408,224 9
N90 scaffold size (bases) 22,699 55,295 24,711,178
L90 scaffold number 351 732 11
Assembly size (bases) 498,929,800 499,131,000 498,872,271
Number of scaffolds 12,228 11,461 9,224
Number of scaffolds ≥100 kb 147 633 13
Number of scaffolds ≥1 Mb 95 128 11
Number of scaffolds ≥10 Mb 5 0 11
Longest scaffold (bases) 16,127,237 6,075,845 65,136,230
% N 0.06 0.06 0.06
GC content (%) 33.55 33.55 33.55
BUSCO evaluation (% completeness) - - 94.4
Details are in the caption following the image
Genomic landscape of V. mungo chromosomes. (a) Physical map of 11 assembled chromosomes (Mb scale) numbered according to size. (b) Repeat density represented by proportion of genomic regions covered by repetitive sequences in 500 kb windows. (c) Gene density represented by number of genes in 500 kb windows. (d) GC content represented by percentage of G + C bases in 500 kb windows. (e) SNP density represented by number of SNP markers in 500 kb windows. (f) Syntenic blocks are depicted by connected lines [Colour figure can be viewed at wileyonlinelibrary.com]

To evaluate the quality of the final genome assembly, we aligned genomic DNA reads to the genome sequences and found that 93.8% of the reads from the MGI shotgun libraries could be mapped back to the assembly. We also aligned our RNA-seq reads and full-length Iso-seq transcript sequences as well as publicly available RNA-seq reads to the assembly. The percentages of our RNA-seq and Iso-seq reads that could be mapped to the genome were 94.6% and 99.0%, respectively. The percentages of reads mapped to the genome assembly were 95.4%, 97,4% and 99.8% for RNA-seq reads from the NCBI Genbank accessions SRR3141655 (Kundu et al., 2015), SRR2058996 (Kundu et al., 2019) and SRR554452, respectively. To further assess the completeness of our V. mungo genome assembly, we employed the BUSCO software to check the gene content using a plant-specific database of 1,440 genes (Simão et al., 2015). Our gene predictions recovered 94.4% of the highly conserved orthologues in the Embryophyta lineage, with 91.9% identified as “complete” and 2.5% identified as “partial” (Table 1).

To annotate the genome, we used a combination of ab initio prediction, homology-based search and transcript evidence from both Iso-seq and RNA-seq data for gene prediction. The genome annotation contained 32,729 predicted gene models, of which 29,411 (89.86%) were protein-coding genes (Tables S2, S3, S4). The most prevalent gene ontology (GO) term associated with cellular component was integral component of the membrane (7,265), followed by nucleus (2,788; Figure S3). The largest category of genes annotated to molecular function was ATP binding (2,718), followed by metal ion binding (1,178) and DNA binding (1,149; Figure S3). Only 3,318 genes (10.13%) remained functionally unannotated. We observed an uneven distribution of genes, with an increase in density towards the ends of the chromosomes (Figure 1). Compared with other sequenced legume genomes, the number of predicted genes in the V. mungo genome was lower than that of V. angularis (Yang et al., 2015), G. max (Schmutz et al., 2010) and Cajaus cajan (Varshney et al., 2012), and higher than that of P. vulgaris (Schmutz et al., 2014), V. radiata (Kang et al., 2014) and C. arietinum (Varshney et al., 2013), but similar to V. unguiculata (Lonardi et al., 2019). We found transcript support for 21,926 protein-coding genes (74.6%), comparable to the percentage reported in barrel medic (76.7%) (Young et al., 2011) but considerably higher than the percentage reported in adzuki bean (53%) (Yang et al., 2015).

The average gene length was 3,123 nt with 5.22 exons per gene, and the average exon length was 226 nt (Table S2). In addition to coding sequences, we identified 5,202 microRNAs, 979 tRNAs, 271 rRNAs and 322 small nuclear RNAs (Table S5). The GC content of the V. mungo genome is 33.6%, similar to other sequenced legume genomes (Sato et al., 2008; Schmutz et al., 2010, 2014; Varshney et al., 2012, 2013; Yang et al., 2015; Young et al., 2011). The GC content in coding regions (43.0%) was higher than that observed in introns and untranslated regions (32.9%; Table S2).

3.2 Comparative genomics and phylogenetic analyses

To investigate the evolutionary relationships between black gram and other plant species, we analysed the gene sets from nine legumes: soybean (G. max), cowpea (V. unguiculata), adzuki bean (V. angularis), mungbean (V. radiata), créole bean (V. reflexo-pilosa), common bean (P. vulgaris), a diploid progenitor of cultivated peanut (A. duranensis), chickpea (C. arietinum) and barrel medic (M. truncatula); two cuburbit species: cucumber (C. sativus) and melon (C. melo); two rosid species: peach (P. persica) and Arabidopsis; one monocot: rice (O. sativa). A monocot representative, rice, was included in the analysis as an outgroup species. We chose to include cucumber and melon because of their known divergence time (Sebastian et al., 2010) whereas peach and Arabidopsis were selected as representatives of rosid species because of the availability of their complete genome sequences. Of 609,118 input proteins from 15 species, 546,885 (89.78%) were clustered into 20,784 orthologous groups. Sequence information from single-copy orthologous genes was used to construct a maximum-likelihood phylogenetic tree, revealing that V. mungo and V. radiata diverged approximately 2.7 MYA ( Figure 2a, Figure S4). The ancestor of V. mungo and V. radiata formed a sister clade to the ancestor of V. reflexo-pilosa and V. angularis, and the two clades diverged about 4.2 MYA. This placement in the phylogenetic tree was consistent with previous reports (Doi, Kaga, Tomooka, & Vaughan, 2002; Tun Tun & Yamaguchi, 2007).

Details are in the caption following the image
Comparative genomics of V. mungo and other sequenced plant species. (a) Inferred phylogenetic tree of V. mungo and 14 plant species based on protein sequences of single-copy orthologous genes. Bar charts display the number of proteins that were widespread (found in at least 14 of the 15 species), legume-specific and species-specific (having no detectable orthologues in other species). Numbers at each node represent the estimated divergence time of each node in million years ago (MYA). Numbers in green (+) and red (-) represent the numbers of gene families that have expanded or contracted, respectively, relative to their ancestor. (b) Distribution of 4DTv distances between orthologous genes in the genomes of V. mungo (black gram), V. radiata (mungbean), V. angularis (adzuki bean), V. unguiculata (cowpea), V. reflexo-pilosa (créole bean) and P. vulgaris (common bean). (c) Distribution of 4DTv distances between paralogous genes in the genomes of V. mungo, V. radiata, V. angularis, V. unguiculata, V. reflexo-pilosa and P. vulgaris [Colour figure can be viewed at wileyonlinelibrary.com]

We analysed gene family expansion and contraction in 10 legumes and five other plant species. Of the 20,784 gene families identified among 15 species, 32 (0.15%) and 352 (1.69%) were significantly expanded or contracted in V. mungo, respectively, after the speciation from V. radiata (Figure 2a, Figure S4). We further investigated the functions of genes in the expanded families and observed a number of genes encoding protein kinases and transcription factors (Table S6). On contrary, a large number of contracted gene families were involved in the ubiquitination pathway and genes encoding pentatricopeptide repeat-containing proteins, which appeared to mediate gene expression through the regulation of RNA stability and translation (Manna, 2015) (Table S6). Of 25,859 protein-coding genes in black gram that had orthologues present in the other 14 species analysed, 7,091 (21.62%) genes were specific to black gram and only 404 (1.23%) genes were shared among legume species (Figure 2a). The proportions of species-specific (1.23%–1.56%) and legume-specific (20.87%–22.61%) genes were similar among the diploid Vigna species.

We used the 4DTv approach, which measures the number of transversions at fourfold degenerate synonymous sites (i.e., codons in which any base at the third position is translated into the same amino acid) to analyse orthologous gene pairs in order to estimate relative timing of evolutionary divergence between V. mungo and closely related phaseoloid species. Figure 2b showed the peak 4DTv distance of 0.053 (V. mungoP. vulgaris), which was higher than the peak 4DTv distances of 0.037 (V. mungoV. unguiculata), 0.021 (V. mungoV. angularis), 0.0167 (V. mungoV. radiata) and 0.0157 (V. mungoV. reflex-pilosa), implying that V. mungo and P. vulgaris diverged prior to the speciation events separating V. mungo, V. radiata, V. angularis and V. reflexo-pilosa. Comparison of 5,527 pairs or paralogous genes residing in duplicated collinear blocks within V. mungo genome revealed a single peak at 0.25, suggesting that black gram has experienced only one ancient whole genome duplication event in contrast to the V. reflexo-pilosa genome, which had a sharp peak at 0.019 indicative of a recent genome-wide duplication event (Figure 2c). Examination of synteny with other phaseoloid species revealed extensive conservation between V. mungo and four warm-season legumes (V. radiata, V. angularis, V. unguiculata and P. vulgaris; Figure S5).

3.3 Repetitive sequence analysis

To analyse the repetitive sequences in V. mungo, we used a combination of de novo repeat identification tool, RepeatModeler, and homology search tools and found that 205.4 Mb (41.1%) of the genome assembly contained repetitive DNA (Table S7). The proportion of the repetitive sequences in V. mungo was lower than that reported for mungbean (50.1%) (Kang et al., 2014), V. angularis (44.5%) (Yang et al., 2015), V. unguiculata (49.5%) (Lonardi et al., 2019) and P. vulgaris (45%) (Schmutz et al., 2014). A genome-wide distribution plot showed that DNA elements, long terminal repeat (LTR) retrotransposons, long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) were enriched near the centromeric regions (Figure 1). Classification of the observed transposable elements into known classes revealed that the majority of them were retrotransposons (24% of the assembly) whereas DNA transposons represented only 4.9% of the genome, consistent with the observations in other legume genomes (Table S7) (Varshney et al., 2012; Yang et al., 2015). LTRs were the predominant class of transposable elements in the V. mungo genome, occupying 57.1% of the repetitive DNA identified. The most abundant LTR superfamilies, Gypsy and Copia occupied 33.6% and 23.1% of the repetitive elements in the genome, respectively (Figure 3a, Table S7). The proportions of LTR retroelements, LINEs, SINEs and DNA transposons in the V. mungo genome are comparable to those in V. radiata (Kang et al., 2014), V. unguiculata (Lonardi et al., 2019) and V. angularis (Yang et al., 2015) genomes (Figure 3a).

Details are in the caption following the image
Repetitive elements in V. mungo and closely related legumes. (a) Repetitive elements in V. mungo and other related legume species. (b) Distribution of LTR retrotransponson insertion time in V. mungo V. radiata, V. angularis, V. unguiculata and P. vulgaris [Colour figure can be viewed at wileyonlinelibrary.com]

The insertion times for LTR retrotransposons were estimated based on predicted full-length LTRs in the genome assembly. Even though there were ancient repeat sequences that inserted into the black gram genome over 10 MYA, nearly three quarters of the LTR retroelements (73%) integrated into the genome within the last five million years. Notably, the insertion time of ~ 23% of the LTR elements were more recent than 2 MYA (Figure 3b), suggesting that they started accumulating after the divergence of V. mungo and V. radiata from V. angularis. We also observed that LTR retrotransposons appeared to accumulate earlier in V. mungo, V. radiata and P. vulgaris genomes (2–5 MYA) than in the V. unguiculata genome (1–2 MYA; Figure 3b).

3.4 Genome evolution in the phaseoloid clade

To investigate chromosome evolution of black gram and other legumes within the phaseoloid clade, we analysed syntenic blocks across their genomes. Nine ancestral chromosomes were inferred on the basis of syntenic and phylogenetic relationships among those genomes based on 21,969 orthologous groups. Seven V. mungo chromosomes (chromosomes 1, 3, 4, 5, 7, 8 and 9) exhibited a one-to-one relationship with the V. radiata and V. unguiculata genomes while six V. mungo chromosomes exhibited a one-to-one relationship with the V. angularis and P. vulgaris genomes (Figure 4). Two V. mungo chromosomes (chromosomes 3 and 4) exhibited one-to-two syntenic relationship with G. max due to the whole genome duplication event in the Glycine genome; Figure 4). Relative to the nine ancestral chromosomes configuration, V. mungo, V. unguiculata and P. vulgaris genomes appeared to best preserve the ancestral karyotype, followed by V. angularis and V. radiata genomes with three out of nine chromosomes remaining in the ancestral state. Half of the chromosomes in Phaseolinae species and most G. max chromosomes were derived from a series of fusion and fission events.

Details are in the caption following the image
Evolution of phaseoloid genomes. Reconstructed ancestral genomes and genomes of selected Vigna species are presented in a phylogenic tree display. Chromosome numbers are indicated on top of each chromosome. Colour arrows denote chromosomes with either one-to-one or one-to-two syntenic relationships between V. mungo and V. radiata, V. angularis, V. unguiculata and P. vulgaris [Colour figure can be viewed at wileyonlinelibrary.com]

3.5 V. mungo population structure

To investigate genetic diversity and variations in the germplasm, 89 black gram accessions were selected and shotgun-sequenced using MGI sequencing platform. A total of 3,164,866,082 of high quality, cleaned reads (474 Gb) were mapped to V. mungo genome assembly with an average mapping rate of 97.6% (Table S1). The population structure was explored using SNP markers at fourfold degenerate sites using STRUCTURE (Pritchard, Stephens, & Donnelly, 2000). Based on the Evanno method (Earl & vonHoldt, 2012; Evanno et al., 2005), we found that K = 3 was the best fit to the data (Figure 5, Figure S6). Our results showed that the 89 black gram accessions were an admixture of three subpopulations, and each subpopulation comprised accessions from different geographical origins/countries (Figure 5, Figure S6). These results were consistent with the previous genetic diversity study using simple sequence repeat markers (Kaewwongwal et al., 2015). The majority of V. mungo accessions used in this study were from India, Nepal and Pakistan (Table S1). Geographic vicinity, cultural ties and migration of people (Gartaula & Niehof, 2013) among these countries may account for the genetic admixture observed.

Details are in the caption following the image
Black gram population structure. Population structure of 89 V. mungo accessions. A neighbour-joining phylogenetic tree constructed using SNP markers at fourfolddegenerate sites. A vertical bar represents each accession, and the length of each colour indicates the proportion contributed by ancestral populations [Colour figure can be viewed at wileyonlinelibrary.com]

3.6 Alternative splicing in V. mungo

Alternative splicing contributes to the diversity of transcriptome and proteome. The availability of full-length transcript isoforms from the long-read PacBio sequencing technology allowed us to investigate the following alternative splicing events in black gram: alternative 5' donor site selection, alternative 3' acceptor site selection, exon skipping and intron retention. A total of 2,590 alternative splicing events were detected in V. mungo (Figure S7a). While the alternative 3' acceptor site selection (33%), alternative 5' donor site selection (30%) and intron retention (29%) events were observed at similar frequencies, exon skipping was the least prevalent mode of alternative splicing, representing only 8% of the total events (Figure S7a). Occasionally, different types of alternative splicing were observed in a combinatorial manner in a single gene. Figure S7b illustrated an example of a transcript that was subjected to multiple forms of alternative splicing. Alternative splicing serves to diversify an organism's transcriptome, and recent data suggest that it is one of the mechanisms that plants use to adapt to a changing environment (Reddy, 2007; Shang, Cao, & Ma, 2017; Wang & Brendel, 2006). Future studies using RNA samples from different developmental stages and various growth conditions will be required to thoroughly probe alternative splicing events in order to obtain a complete repertoire of transcript isoforms in V. mungo.

4 DISCUSSION

In this study, we employed the 10X Genomics technology to obtain a de novo whole genome assembly of V. mungo. The 10X Genomics linked-read strategy utilizes emulsion technology to partition long DNA fragments into micelles, within which small DNA fragments are amplified and tagged by a shared barcode. After sequencing, the barcodes are used to identify sequences that are in close proximity in the genome, and long DNA fragments can be reconstructed based on these linked reads (Ott et al., 2018). As the 10X Genomics linked-read technology utilizes Illumina sequencing, it is more cost-effective to generate the preliminary assembly using this approach compared to the long-read PacBio sequencing (Zhang, Sun, et al., 2019; Zhang et al., 2019) . The preliminary assembly obtained from the 10X Genomics linked-read strategy was 498.9 Mb with a scaffold N50 length of 5.2 Mb. To achieve higher contiguity, we further assembled the black gram genome using the long-range positional information from the Chicago and HiC techniques. The HiC approach identifies chromosomal interactions using chromosome conformation capture. HiC data provide long-range linkage information up to tens of megabases and can be used to generate chromosome-scale scaffolds (Burton et al., 2013; Marie-Nelly et al., 2014). The final assembly reported here is the first high-quality, chromosome-scale genome assembly in V. mungo, containing 11 chromosomes corresponding to the haploid chromosome number. The size of the preliminary assembly covered 86.9% of the estimated genome size based on flow cytometry analyses (Arumuganathan & Earle, 1991; Pal, 2006). Comparative genomics analyses based on sequence information from single-copy orthologous genes revealed that V. mungo and V. radiata diverged about 2.7 MYA. Unlike in V. reflexo-pilosa, there was no evidence supporting a recent whole genome duplication event in V. mungo. The proportion of repetitive elements in the black gram genome is slightly lower than the numbers reported for related Vigna species; however, it should be noted that the quality of the genome assemblies and/or the repeat identification methods used might affect the percentages of repetitive sequences reported in each species. The majority of LTR retrotransposons appeared to integrate into the genome within the last five million years.

We also obtained Iso-seq data from multiple tissues (leaf, root, stem, flower, 1-week-old pod and 3-week-old pod) and identified transcript variants exhibiting alternative splicing events. In contrast to the observations made in black gram, intron retention has been reported as the most prevalent alternative splicing mechanism in several plant species such as Arabidopsis (Marquez, Brown, Simpson, Barta, & Kalyna, 2012), G. max (Shen et al., 2014), V. radiata (Satyawan, Kim, & Lee, 2017), cotton (Feng, Xu, Liu, Cui, & Zhou, 2019), maize (Thatcher et al., 2016; Wang et al., 2016) and rice (Zhang, Sun, et al., 2019; Zhang, Zhou, et al., 2019). Our high-quality genome assembly along with the genomic variation information from the germplasm provides an invaluable resource for investigating marker-trait association at a whole genome level, gene expression analyses and comparative genomics and phylogenetic studies in legume species.

ACKNOWLEDGEMENTS

This study was supported by the National Omics Center under the National Science and Technology Development Agency, Thailand, grant number: 1000221.

    AUTHOR CONTRIBUTIONS

    W.P., and S.T. designed research study. W.P, T.Y., D.S., N.J., S.U., P.S., and K.L. performed laboratory work (sample collection, DNA and RNA extraction, library construction and sequencing). C.N., C.S., W.N., and W.K. performed bioinformatics analyses. W.P. wrote and revised the manuscript, and all authors reviewed it.

    DATA AVAILABILITY STATEMENT

    V. mungo genome assembly, Iso-seq and RNA-seq data have been submitted to the DDBJ/EMBL/Genbank databases under BioProject number PRJNA623719: genome assembly–JABCND000000000; Iso-seq data–SRR11787985 and SRR11787359; RNA-seq data–SRR11775845, SRR11775821, SRR11775823, SRR11775824, SRR11775822, SRR11775544.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.