De novo assemblies of Luffa acutangula and Luffa cylindrica genomes reveal an expansion associated with substantial accumulation of transposable elements
Chutima Sonthirod, Chaiwat Naktang, Wanapinun Nawae have equal contributions.
Abstract
Luffa spp. (sponge gourd or ridge gourd) is an economically important vegetable crop widely cultivated in China, India and Southeast Asia. Here, we employed PacBio long-read single-molecule real-time (SMRT) sequencing to perform de novo genome assemblies of two commonly cultivated Luffa species, L. acutangula and L. cylindrica. We obtained preliminary draft genomes of 734.6 Mb and 689.8 Mb with scaffold N50 of 786,130 and 578,616 bases for L. acutangula and L. cylindrica, respectively. We also applied long-range Chicago and HiC techniques to obtain the first chromosome-scale whole-genome assembly of L. acutangula. The final assembly contained 13 pseudomolecules, corresponding to the haploid chromosome number in Luffa spp. (1n = 13, 2n = 26). The sizes of the assembled Luffa genomes are approximately twice as large as the genome assemblies of related Cucurbitaceae. A large proportion of L. acutangula (62.17%; 456.69 Mb) and L. cylindrica (56.78%; 391.65 Mb) genome assemblies contained repetitive elements. Phylogenetic analyses revealed that the substantial accumulation of transposable elements likely contributed to the expansion of the Luffa genomes. We also investigated alternative splicing events in Luffa using full-length transcript sequences obtained from PacBio Isoform Sequencing (Iso-seq). While the predominant form of alternative splicing in most plant species examined was intron retention, alternative 3’ acceptor site selection appeared to be a major event observed in Luffa. High-quality genome assemblies for L. acutangula and L. cylindrica reported here provide valuable resources for Luffa breeding and future genetics and comparative genomics studies in Cucurbitaceae.
1 INTRODUCTION
Luffa spp. (commonly known as sponge gourd, ridge gourd, loofah or dishcloth gourd (Joshi, Tiwari, Kc, Ghale, & Gyawali, 2013)) belong to the family Cucurbitaceae. They are cross-pollinated diploid species with 26 chromosomes (2n = 26) (Wu et al., 2016). The genus Luffa comprises nine species (Filipowicz & Schaefer, 2014; Prakash, Pandey, Jalli, & Bisht, 2013), two of which, Luffa acutangula (L.) Roxb. (ridge gourd) and Luffa cylindrica (L.) Roem. (sponge gourd), are domesticated (Dassanayake & Forsberg, 1988). Luffa spp. are prevalent in the subtropical regions of Asia, and it is believed to have an Asian origin (Heiser & Schilling, 1988; Heiser, Schilling, & Dutt, 1988). Both L. acutangula and L. cylindrica are widely cultivated in India, China, Thailand, Central America and Africa (Oboh & Aluyor, 2009; Rabei, Rizk, & Khedr, 2013; Wu et al., 2014). Immature Luffa fruits serve as nutrient-rich vegetables that are abundant in bioactive compounds beneficial to human health such as glycosides, alkaloids, flavonoids and sterols (Partap, Kumar, Sharma, & Jha, 2012). Mature fruits contain a tough and fibrous network of cellulose that can be used as bathing or cleaning sponges as well as biodegradable filters (Oboh & Aluyor, 2009; Zhang, Hu, Zhang, Guan, & Zhang, 2007). Luffa fruits have also been used in traditional medicine to treat anaemia, leucoderma and tumours (Manikandaselvi, Vadivel, & Brindha, 2016).
Luffa breeding programmes tend to employ conventional approaches as marker-assisted selections are still in an early stage due to limited genetic and genomic resources available. Using an F2 population derived from an interspecific cross between L. acutangula and L. cylindrica, Cui et al. (2015) reported the first genetic linkage map constructed with 258 sequence-related amplified polymorphism (SRAP) markers (Cui et al., 2015). More recently, another linkage map was constructed using simple sequence repeat (SSR) markers developed from transcriptome sequencing (Wu et al., 2014); however, the map resolution was relatively low with an average distance of 8.11 cm between adjacent markers (Wu et al., 2016). In addition, two studies on fruit browning reported the genome-wide transcriptome sequencing of L. cylindrica (Chen et al., 2015; Zhu et al., 2017). The first attempt to perform a genome survey sequencing of Luffa was carried out by An et al. (2017). The first available assembly for L. cylindrica, obtained from a small-insert (220-bp) library, was highly fragmented with a scaffold N50 length of merely 807 bp (An et al., 2017). Recently, Zhang et al. (2020) reported a de novo assembly of the L. cylindrica genome (Zhang et al., 2020), utilizing the Pacific Biosciences (PacBio) sequencing platform, which offers kilobase-sized reads without GC-bias or systematic errors. A combination of long-read PacBio assembly and long-range HiC scaffolding technique provides an effective approach to produce a high-quality reference assembly. Here, we combined PacBio long-read single-molecule real-time (SMRT) sequencing and Chicago/HiC techniques to obtain de novo genome assemblies of two cultivated Luffa species, L. acutangula and L. cylindrica. Comparative genomics and phylogenetic analyses revealed that the substantial accumulation of repetitive elements, especially long terminal repeat (LTR) retrotransposons, contributed to the large genome sizes of Luffa spp. These high-quality genome assemblies along with the genomic variation information from L. acutangula and L. cylindrica germplasm will provide a foundation for basic and applied research, expediting the progress towards the development of elite varieties through marker-trait association analyses.
2 MATERIALS AND METHODS
2.1 Plant materials and DNA/RNA isolation
Sixty-one ridge gourd (L. acutangula) and twenty-three sponge gourd (L. cylindrica) accessions maintained at Chia Tai Company Limited (Thailand) were used in this study. Fresh healthy leaf tissues were collected, immediately frozen in liquid nitrogen and stored at −80°C until DNA extraction. To obtain high-molecular-weight DNA for PacBio single-molecule real-time (SMRT) sequencing, frozen tissues were pulverized in liquid nitrogen, and the CTAB buffer (2% CTAB, 1.4 M NaCl, 2% PVP, 20 mM EDTA pH 8.0, 100 mM Tris-HCl pH 8.0, 0.4% SDS) was added. DNA was extracted from the aqueous phase twice using 25:24:1 phenol:chloroform:isoamyl alcohol and precipitated in 2.5 volumes of absolute ethanol. DNA pellets were washed with 70% ethanol twice, air-dried and resuspended in 10 mM Tris-HCl pH 8.0. DNA samples were subsequently purified with the Ampure® PB beads (Pacific Biosciences, Menlo Park, CA, USA), and the DNA integrity was assessed using the Pippin Pulse Electrophoresis System (Sage Science, Beverly, MA, USA). Total RNA was extracted from above-ground and root tissues using CTAB buffer and 25:24:1 phenol:chloroform:isoamyl alcohol as mentioned above. RNA was precipitated overnight in ¼ volumes of 8M LiCl, washed with 70% ethanol, air-dried and resuspended in RNase-free water. Poly(A) mRNAs were enriched from total RNA samples using the Dynabeads mRNA Purification Kit (Thermo Fisher Scientific, Waltham, MA, USA).
2.2 Genomic and transcriptomics (Iso-seq) library preparation and sequencing
SMRTbell libraries with an insert size of 12,000 nt were constructed for the PacBio RSII sequencing system. Sequencing was performed with P6-C4 polymerase and chemistry using 360-min movie times according to manufacturer's protocols. For short-read whole-genome shotgun sequencing of the 84 accessions described previously, DNA was isolated using the protocol reported in Pootakham et al. (2017). Illumina paired-end libraries (2 × 150 bp) were prepared and sequenced on the HiSeqX by NovogeneAIT Genomics Singapore Pte Ltd (Singapore). Iso-seq libraries were prepared according to a previously published protocol (Pootakham et al., 2017) using the SMARTer PCR cDNA Synthesis Kit (Clontech, Mountain View, USA) and size-selected using the BluePippin Size Selection System (Sage Science, Beverly, USA) into 1–2 kb, 2–3 kb and 3–6 kb bins. Sequencing was performed using polymerase chemistry and the movie time mentioned above.
2.3 Chicago library preparation and sequencing
Chicago library preparation and sequencing were carried out by Dovetail Genomics (Scotts Valley, CA, USA). A Chicago library was prepared as described previously (Putnam et al., 2016). Briefly, ~500 ng of HMW gDNA (mean fragment length = 61) was reconstituted into chromatin in vitro and fixed with formaldehyde. Fixed chromatin was digested with DpnII, the 5’ overhangs filled in with biotinylated nucleotides, and then, free blunt ends were ligated. After ligation, cross-links were reversed and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~350-bp mean fragment size, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq X to produce 107 million 2 × 150 bp paired-end reads, which provided 6.30 × physical coverage of the genome (1–100 kb pairs).
2.4 Dovetail HiC library preparation and sequencing
Dovetail HiC library preparation and sequencing were carried out by Dovetail Genomics (Scotts Valley, CA, USA). A Dovetail HiC library was prepared in a similar manner as described previously (Lieberman-Aiden et al., 2009). Briefly, for each library, chromatin was fixed in place with formaldehyde in the nucleus, and then, extracted fixed chromatin was digested with DpnII, the 5’ overhangs filled in with biotinylated nucleotides, and then, free blunt ends were ligated. After ligation, cross-links were reversed and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~350-bp mean fragment size, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq X to produce 101 million 2 × 150 bp paired-end reads, which provided 393.88 x physical coverage of the genome (10–10,000 kb pairs).
2.5 De novo genome assembly
A total of 2,823,498 and 2,404,358 raw reads (totalling 39.73 and 39.65 Gb) from L. acutangula and L. cylindrica, respectively, were subjected to read correction, trimming, overlap detection and de novo assembly by Canu v1.8 (Koren et al., 2017) using the following parameters: genomeSize = 790m correctedErrorRate = 0.040. An estimated genome size of 790 Mb (An et al., 2017) was assumed for both L. acutangula and L. cylindrica, and besides the parameters mentioned above, default parameters were used in the assembly process. The polishing was carried out using the GenomicConsensus package in the SMRT Analysis software suite version 2.3. The GenomicConsensus package contains a main driver programme (variantCaller), which provides two consensus/variant calling algorithms: Arrow and Quiver (https://github.com/PacificBiosciences/GenomicConsensus). The PacBio preliminary assembly was used as an input for the subsequent scaffolding with HiRise. Calculation of k-mer depth distribution for clean Illumina sequence reads and estimation of L. acutangula and L. cylindrica genome sizes were performed using Jellyfish software version 2.2.10 with the C-setting (Marcais & Kingsford, 2011).
2.6 Scaffolding the assembly of L. acutangula with HiRise
The input de novo assembly, shotgun reads, Chicago library reads and Dovetail HiC library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al., 2016). An iterative analysis was conducted. First, shotgun and Chicago library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of Chicago read pairs mapped within draft scaffolds were analysed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and make joins above a threshold. After aligning and scaffolding Chicago data, Dovetail HiC library sequences were aligned and scaffolded following the same method. After scaffolding, shotgun sequences were used to close gaps between contigs.
2.7 Assembly quality assessment
Prior to short-read alignment, we used TrimGalore-0.6.0 (https://github.com/FelixKrueger/TrimGalore) for trimming adapter sequences and removing low-quality bases from Illumina reads. Iso-seq reads were processed using the Iso-Seq3 pipeline (https://github.com/PacificBiosciences/IsoSeq) with default parameter setting. The quality of the assemblies was evaluated by aligning short-read Illumina sequences, Iso-seq transcript sequences and available genome/transcriptome sequences from previous studies (An et al., 2017; Chen et al., 2015; Zhu et al., 2017) using BLASTN at an e-value cut-off of 10–10. The completeness of the final genome assemblies was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO) (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). The BUSCO pipeline version 3 was used to test for the presence and completeness of orthologs using the Embryophyta OrthoDB release 9 (Kriventseva et al., 2015).
2.8 Annotation of repetitive elements and repeat masking
To generate a de novo repeat library, RepeatModeler version 2.0.1 (http://www.repeatmasker.org/RepeatModeler/) was used to predict transposable elements in the unannotated genome assemblies. Two de novo repeat-finding programmes, RECON version 1.08 and RepeatScout version 1.0.5, were employed to identify the boundaries of repetitive elements and to build consensus models of interspersed repeats. To ensure that repeat sequences in the library did not contain large families of protein-coding sequences that are not transposable elements, we aligned them to GenBank's nr protein database using BLASTX (e-value cut-off of 10–6). The custom Luffa-specific repeat library generated by RepeatModeler along with the repetitive sequences in the RepBase plant repeat database (20,150,807; https://www.girinst.org/) was used to mask the assembled genome sequences using RepeatMasker version 4.0.9_p2 (default parameters) (Tempel, 2012). To estimate the insertion time for the LTR retrotransposons, we first employed the LTR_FINDER (Ou & Jiang, 2019) and LTRHarvest (Ellinghaus, Kurtz, & Willhoeft, 2008) programs to predict full-length LTR using default parameter setting. We subsequently used the LTR_retriever program (Ou & Jiang, 2018) to filter out false positives from the initial prediction inputs from LTR-FINDER and LTRHarvest. The insertion times of LTRs (T) were calculated according to the following formula: T = K/2μ, where K is the divergence rate calculated with the Jukes-Cantor model for non-coding sequences and μ is neutral mutation rate (4.5e−9; estimated based on known divergence time between cucumber and melon (Sebastian, Schaefer, Telford, & Renner, 2010)).
2.9 Gene annotation
Evidences from transcriptome-based prediction, ab initio gene prediction and homology-based prediction were combined to predict protein-coding sequences in the unmasked Luffa genomes using EvidenceModeler (EVM) version 1.1.1 r2015-07-03 (Haas et al., 2008). Transcriptome-based prediction methods combined information from PacBio Iso-seq data obtained from leaf, root, apical meristem and flower and available short-read transcriptome data (Chen et al., 2015; Wu et al., 2014; Zhu et al., 2017). Full-length transcripts were mapped to the genome assemblies using Genomic Mapping and Alignment Program (GMAP; version r20160630) (Wu & Watanabe, 2005), and short-read RNA-seq data were mapped to the assemblies during the initial step of annotation using the PASA2 pipeline version 2.0.1 (Haas et al., 2008). Protein sequences from L. cylindrica, Cucumis sativus, Cucumis melo, Citrullus lanatus, Arabidopsis thaliana, Cucurbita maxima and Mormodica charantia obtained from public databases were aligned to the unmasked genome using AAT version 1.52 (Huang, Adams, Zhou, & Kerlavage, 1997). Two ab initio gene predictors were run on the unmasked assemblies. Protein-coding gene predictions were obtained with Augustus version 3.2.1 (Stanke, Steinkamp, Waack, & Morgenstern, 2004) trained with C. sativus, C. melo, C. lanatus, A. thaliana, C. maxima and M. charantia PASA transcriptome alignment assembly and BRAKER (Hoff, Lange, Lomsadze, Borodovsky, & Stanke, 2016; Hoff, Lomsadze, Borodovsky, & Stanke, 2019) using Luffa Iso-seq and RNA-seq (Chen et al., 2015) alignment files as inputs. All gene predictions were integrated by EVM to generate consensus gene models using the following weight for each evidence type: PASA2 – 1, GMAP – 0.5, AAT – 0.3, Augustus – 0.3 and BRAKER – 0.3. The positions of annotate genes were cross-checked with those of known repeats, and any gene that had more than 20% overlapping sequence with repetitive elements were excluded from the list of annotated genes.
2.10 Comparative genomics and phylogenetic analysis
We used OrthoFinder (Emms & Kelly, 2019) to identify orthologous groups in L. acutangula, L. cylindrica and six other cucurbits (melon, cucumber, bottle gourd, watermelon, squash, pumpkin, bitter melon), three rosid species (peach, Arabidopsis, grape), one asteroid species (tomato) and one monocot (rice). Protein sequences from single-copy orthologous groups were used to construct phylogenetic tree using the RAxML-ng program (Kozlov, Darriba, Flouri, Morel, & Stamatakis, 2019). We first aligned protein sequences in each single-copy orthologous group with MUSCLE (Edgar, 2004) and removed alignment gaps with trimAl (Capella-Gutiérrez, Silla-Martínez, & Gabaldón, 2009) using the automated1 heuristic method. All alignment blocks were concatenated using catsequences program (https://github.com/ChrisCreevey/catsequences), and substitution model for each alignment block was estimated using ModelTest-NG program (Darriba et al., 2019). The outputs were subsequently used to compute a maximum-likelihood phylogenetic tree. Divergence times of species in the phylogenetic tree were estimated with the MCMCtree program (PAML4 package) (Yang, 2007) using the relaxed-clock model with the known divergence time between cucumber and melon, which was estimated at 8.4–11.8 million years ago (MYA) Sebastian et al., 2010).
2.11 Cucurbitaceae evolutionary analysis
We adapted the method described in Ren, Huang, and Cannon (2019) to reconstruct the ancestral genome of L. acutangula, pumpkin, bottle gourd, watermelon, melon and cucumber. In brief, we used OrthoFinder to identify orthologous groups in these six species. Syntenic blocks were then constructed from the orthologous groups using the DAGchainer program (Haas, Delcher, Wortman, & Salzberg, 2004) in the Synima pipeline (Farrer, 2017). The outputs from DAGchainer were used to specify “markers” representing features that were shared by the selected genomes using the scripts provided in Ren et al. (2019). In the following step, we used MLGO web service (http://www.geneorder.org/server.php) (Hu, Lin, & Tang, 2014) to infer the ancestral genome from the order of the markers in each individual genome and the information from the phylogenetic tree constructed using single-copy orthologs (Kozlov et al., 2019).
2.12 Luffa phylogenetic relationship and population structure analysis
Short-read whole-genome shotgun sequences of the 84 Luffa accessions were used for the phylogenetic analysis. Illumina reads were mapped to their respective genome assemblies using Minimap2 version 2.11-r797-v03 (Li, 2018), and single nucleotide polymorphism (SNP) markers were called using GATK HaplotypeCaller 3.8 (McKenna et al., 2010). We used a set of 11,704 SNP markers at fourfold-degenerate sites with the following criteria: (a) a minor allele frequency >0.05; (b) depth coverage between 20X and 200X; (c) fewer than 10% missing data. R package was used to construct a neighbour-joining tree with 1,000 bootstrap replicates (Paradis, Claude, & Strimmer, 2004; Team, 2016). We applied the same set of SNPs to examine the population structure using STRUCTURE program (version 2.3.4) (Falush, Stephens, & Pritchard, 2003) using 10,000 iterations with the number of clusters (K) of 2–4.
3 RESULTS
3.1 Genome sequencing and assembly
We selected two Thai elite inbred lines for sequencing: one ridge gourd (L. acutangula) cultivar AG-4 and one sponge gourd (L. cylindrica) cultivar SO-3. A whole-genome shotgun strategy was used to sequence and assemble both Luffa genomes from PacBio long-read data. A total of 2,823,498 (39.73 Gb) and 2,404,358 raw reads (39.65 Gb), representing 50.29X and 50.19X coverage based on the estimated genome size of 789.97 Mb (An et al., 2017), were generated for L. acutangula and L. cylindrica, respectively. De novo assemblies of PacBio sequences from L. acutangula and L. cylindrica yielded draft genomes of 734.6 Mb and 689.8 Mb in 2,280 and 3,570 scaffolds with scaffold N50 of 786,130 (L50 = 220 scaffolds) and 578,616 bases (L50 = 316 scaffolds), respectively (Table 1). Analyses of k-mer distribution of the genome sequencing reads provided estimated genome sizes of 760 Mb and 773 Mb for L. acutangula and L. cylindrica, respectively, close to the figure previously reported for L. cylindrica (An et al., 2017; Zhang et al., 2020) (Figure S1). The heterozygosity of L. acutangula and L. cylindrica was 0.41 and 0.25, respectively. The preliminary L. acutangula genome was further assembled using the Chicago (in vitro proximity ligation; 107 million read pairs) and HiC (in vivo fixation of chromosomes; 101 million read pairs) library data scaffolded with the HiRise software (Dovetail Genomics, Santa Cruz, CA, USA). The final assembly contained 13 chromosome-scale pseudomolecules (hereafter referred to as chromosomes, numbered according to size; Figure 1, Figure S2) greater than 1 Mb in length, corresponding to the haploid chromosome number in Luffa spp (1n = 13, 2n = 26). The 13 chromosomes covered 618,333,454 bases or 84.06% of the 735-Mb L. acutangula assembly.
L. acutangula | L. cylindrica | |||
---|---|---|---|---|
PacBio | PacBio + Chicago | PacBio + Chicago +HiC | PacBio | |
N50 scaffold size (bases) | 786,130 | 104,669 | 47,609,564 | 578,616 |
L50 scaffold number | 220 | 1,430 | 8 | 316 |
N75 scaffold size (bases) | 360,404 | 46,914 | 42,543,272 | 222,372 |
L75 scaffold number | 571 | 4,103 | 12 | 789 |
N90 scaffold size (bases) | 145,274 | 23,392 | 35,168 | 62,111 |
L90 scaffold number | 1,037 | 7,385 | 834 | 1,667 |
Total (bases) | 734,615,403 | 734,942,309 | 735,610,612 | 689,872,192 |
Number of scaffolds | 2,280 | 15,410 | 7,871 | 3,570 |
Number of scaffolds ≥ 100 kb | 1,230 | 1,547 | 46 | 1,267 |
Number of scaffolds ≥ 1 Mb | 143 | 50 | 13 | 122 |
Number of scaffolds ≥ 10 Mb | — | — | 13 | — |
Longest scaffold (bases) | 8,225,708 | 3,510,179 | 56,032,585 | 7,054,290 |
% N | — | 0.0004 | 0.00135 | — |
GC content (%) | 36 | 36 | 36 | 36 |
BUSCO evaluation (% completeness) | — | — | 92.7 | 93.0 |

To assess the quality of our de novo assemblies, we aligned genomic DNA short reads back to the genomes. Approximately 91% and 93.9% of the reads from Illumina shotgun libraries could be mapped back to L. acutangula and L. cylindrica genomes, respectively. We also aligned Iso-seq reads and publicly available RNA-seq reads (from L. cylindrica (Chen et al., 2015; Zhu et al., 2017)) to the assemblies, and 99.0% of Iso-seq transcripts were mapped to each respective genome while 98.3% of the RNA-seq reads could be mapped to L. cylindrica genome. To further evaluate the completeness of both genome assemblies, we checked the gene content with the BUSCO software using a plant-specific database of 1,440 genes (Simão et al., 2015). Our gene predictions for L. acutangula and L. cylindrica recovered 92.7% and 93.0% of the highly conserved orthologs in the Embryophyta lineage, respectively (Table 1). These evidences supported high-quality assembly of both Luffa genomes.
A combination of ab initio prediction and transcript evidence obtained from both Iso-seq and RNA-seq data was used for gene prediction. The genome annotations of L. acutangula and L. cylindrica contained 42,211 and 50,340 predicted gene models, of which 32,233 and 43,828 were protein-coding genes, respectively (Tables S1 and S2; Figures S3–S5). Genes were preferentially distributed near the telomeres for most of the chromosomes (Figure 1). The numbers of protein-coding genes in Luffa were higher than the figures reported for other cucurbit genomes (cucumber = 23,248 (Li et al., 2011), watermelon = 23,440 (Guo et al., 2013), bottle gourd = 22,472 (Wu et al., 2017) and wax gourd = 27,467 (Xie et al., 2019)). In total, we found transcript support for 25,402 (78.8%; Iso-seq support) and 37,931 (86.5%; Iso-seq and RNA-seq support (Chen et al., 2015)) protein-coding genes in L. acutangula and L. cylindrica, respectively. The proportions of protein-coding genes supported by transcript evidence in Luffa were comparable to that of the bottle gourd genome (79.3%) (Wu et al., 2017) but slightly higher than that of the wax gourd genome (72.7%) (Xie et al., 2019). The average gene sizes for L. acutangula and L. cylindrica were 2,866 and 2,582 nt with 4.50 and 4.28 exons per gene, respectively (Table S1).
3.2 Comparative genomics and phylogenetic analyses
To investigate the evolutionary relationships between Luffa and other Cucurbitaceae species, we analysed the gene sets from nine cucurbits: cucumber (Cucumis sativus), melon (Cucumis melo), watermelon (Citrullus lanatus), bottle gourd (Lagenaria siceraria), squash (Cucurbita moschata) pumpkin (Cucurbita maxima), bitter gourd (Momordica charantia), L. acutangula and L. cylindrica; three rosids: Arabidopsis (Arabidopsis thaliana), grape (Vitis vinifera) and peach (Prunus persica); one asteroid: tomato (Solanum lycopersicum) and one monocot: rice (Oryza sativa). Of 508,876 input proteins from 14 species, 433,861 (85.25%) were clustered into 19,759 orthologous groups. Sequence information from single-copy orthologous genes was used to construct a maximum-likelihood phylogenetic tree, and the divergence time was estimated based on the topology and branch length, revealing that the two Luffa species sequenced in this work diverged about 7.9 million years ago (MYA; Figure 2a). The ancestor of Luffa formed a sister clade to the ancestor of the tribes Cucurbiteae and Benincaseae, and the two clades diverged about 33.62 MYA. This placement in the phylogenetic tree was consistent with previous reports (Chomicki, Schaefer, & Renner, 2019; Renner & Schaefer, 2017).

We used the transversion rate at fourfold-degenerate synonymous sites (4DTv approach) to analyse orthologous gene pairs in order to estimate relative timing of evolutionary divergence between L. acutangula and closely related cucurbit species. This analysis showed that the speciation event between Luffa and other cucurbits occurred right around the time where the duplication event in C. maxima took place (Figure 2b and c). The distribution of 4DTvs among paralogous gene pairs for most species including Luffa had a peak that ranged from 0.3 to 0.6, with the maximum around 0.4 (Figure 2c). Examination of synteny between L. acutangula and L. cylindrica revealed extensive conservation in genome structure (Figure S6). With the exception of L. acutangula chromosomes 6 and 12, the remaining chromosomes exhibited a one-to-one relationship with L. cylindrica chromosomes. There appeared to be a shuffling of regions between L. acutangula chromosomes 6 and 12 and L. cylindrica chromosomes 4 and 13 (Figure S6).
3.3 Expansion of repetitive elements leads to larger genome sizes in Luffa
We identified 456.69 and 391.65 Mb of repetitive elements in L. acutangula and L. cylindrica assemblies, representing 62.17% and 56.78% of the genomes, respectively (Figure 3, Table S3). The length and proportion of repetitive sequences in Luffa genomes were higher than those in the watermelon (160 Mb; 45.2%) (Guo et al., 2013), bottle gourd (147 Mb; 46.9%) (Wu et al., 2017) and bitter gourd (159 Mb; 52.5%) (Matsumura et al., 2019) but not as high as in the wax gourd genome (689 Mb; 75.5%) (Xie et al., 2019). LTR retrotransposons represented the majority of repetitive elements, comprising 41% and 35% of the genomes and 66% and 61% of all repetitive elements in L. acutangula and L. cylindrica, respectively (Figure 3a, Table S3). The most abundant LTR superfamilies, Gypsy and Copia occupied 34% and 31% of the repetitive elements in L. acutangula and 31% and 26% of the repetitive elements in L. cylindrica, respectively (Table S3). A genome-wide distribution plot of each repeat type showed that DNA elements, LTRs, long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) were enriched near the centromeric regions (Figure 1). The sizes of the assembled Luffa genomes are approximately twice as large as the genome assemblies of related Cucurbitaceae (Garcia-Mas et al., 2012; Guo et al., 2013; Huang et al., 2009; Ruggieri et al., 2018; Sun et al., 2017; Wu et al., 2017), with the exception of the wax gourd genome (Xie et al., 2019). The length of LTR elements in Luffa is 8-fold longer than that in cucumber (Huang et al., 2009) and 6-fold longer than in bottle gourd (Wu et al., 2017) (Table S4). Without evidence supporting a recent whole-genome duplication event in Luffa, the substantial accumulation of transposable elements especially the LTR likely contributes to the expansion of the Luffa genomes. The insertion times for LTR retrotransposons were estimated based on predicted full-length LTRs in the genome assembly. Interestingly, LTRs started accumulating after the divergence of Luffa and Benincaseae species (Figure 3b, Figure S7). LTRs appeared to accumulate earlier in the wax gourd (around 6–8 MYA), melon (4–6 MYA) and bitter gourd (2–4 MYA) genomes than in the Luffa genome. A substantial proportion of LTRs in the Luffa genome has proliferated relatively recently (0–1 MYA), similar to the situations observed in the cucumber and squash genomes (Figure S7).

3.4 Cucurbit genome evolution
To investigate chromosome evolution of Luffa and other cucurbits including cucumber, melon, watermelon, bottle gourd and pumpkin, we analysed syntenic blocks across their genomes. Twelve ancestral chromosomes were inferred on the basis of syntenic and phylogenetic relationships among cucurbit genomes based on 20,026 orthologous groups. Three L. acutangula chromosomes (chromosomes 1, 2 and 9) exhibited a one-to-one relationship with the M. charantia genome (chromosomes 1, 11 and 2, respectively), while only two L. acutangula chromosomes exhibited one-to-one syntenic relationship with C. melo and L. siceraria and one-to-two relationship with C. maxima (due to the whole-genome duplication event in the Cucurbita genus; Figure 4). Relative to the twelve ancestral chromosomes configuration, L. acutangula and M. charantia genomes appeared to best preserve the ancestral karyotype, followed by C. maxima genome with six out of twenty chromosomes (chromosomes 3, 7, 10, 13, 15 and 20) remaining in the ancestral state despite the recent whole-genome duplication. All of C. sativus and C. lanatus chromosomes and most L. siceraria and C. melo chromosomes were derived from a series of fusion and fission events.

3.5 Luffa population structure
To investigate genetic diversity and variations in Luffa germplasms, 61 L. acutangula and 23 L. cylindrica accessions were selected and shotgun-sequenced using Illumina sequencing platform. A total of 1,402 and 558 Gb of high quality, cleaned data with an average of 32.8- and 34.6-fold depth coverage were mapped to L. acutangula and L. cylindrica genome assemblies with an average mapping rate of 92.2% and 93.9%, respectively (Table S5). The population structure was explored using fourfold-degenerate sites using STRUCTURE (Pritchard, Stephens, & Donnelly, 2000). We tested for a population structure ranging from 2 (K = 2) up to 4 subpopulations (K = 4; Figure 5). The results supported clustering of L. acutangula and L. cylindrica accessions into three distinct subgroups (Figure 5). Accessions originating from the same geographical location appeared to cluster together.

3.6 Identification of alternative splicing variants
The availability of both the whole-genome sequence and the full-length transcript isoforms enabled the investigation of alternative splicing events in Luffa. To the best of our knowledge, this is the first report on alternative splicing in Luffa spp. We employed the TAPIS pipeline (version 1.2.1) (Abdel-Ghany et al., 2016) and SpliceGrapher program (Rogers, Thomas, Reddy, & Ben-Hur, 2012) to identify transcript variants exhibiting the following alternative splicing events: alternative 5’ donor site selection, alternative 3’ acceptor site selection, exon skipping and intron retention. A total of 1,191 and 1,641 alternative splicing events were detected in L. acutangula and L. cylindrica, respectively (Figure 6a). While alternative 3’ acceptor site selection appeared to be a major event observed in both species (41% in L. acutangula and 37% in L. cylindrica), followed by alternative 5’ donor site selection (29% in L. acutangula and 28% in L. cylindrica), exon skipping was the least prevalent mode of alternative splicing identified in L. acutangula, and intron retention was the least common form found in L. cylindrica (Figure 6a). In contrast to the observations made in Luffa, intron retention has been reported as the most prevalent alternative splicing mechanism in several plant species such as Arabidopsis (Marquez, Brown, Simpson, Barta, & Kalyna, 2012), soya bean (Shen et al., 2014), cotton (Feng, Xu, Liu, Cui, & Zhou, 2019), maize (Thatcher et al., 2016; Wang et al., 2016) and rice (Zhang et al., 2019). Different types of alternative splicing were also observed in a combinatorial manner in a single gene. Figure 6b illustrated an example of a transcript that was subjected to multiple forms of alternative splicing. Alternative splicing serves to diversify an organism's transcriptome, and recent data suggest that it is one of the mechanisms that plants use to adapt to a changing environment (Reddy, 2007; Shang, Cao, & Ma, 2017; Wang & Brendel, 2006). Further studies using RNA samples from different tissues and various growth conditions, including biotic and abiotic stresses, will help elucidate the complete repertoire of transcript isoforms in Luffa species.

3.7 Cucurbitacin biosynthesis pathway in Luffa
With the exception of bitter gourd, bitterness is an economically undesirable trait in other cucurbits including Luffa. We identified ten putative cucurbitacin biosynthetic enzymes in L. acutangula and L. cylindrica, including an oxidosqualene cyclase (Bi), eight cytochrome P450 (CYP) and an acyltransferase (ACT), similar to the number reported in watermelon (Zhou et al., 2016). Since cucurbitacin biosynthetic pathways appeared to be conserved among cucurbits, we carried out a comparative analysis among Luffa, cucumber, watermelon and melon. We identified colinear regions on L. acutangula (chromosome 4) and L. cylindrica (chromosome 3) where the Bi clusters (containing Bi, three CYPs, ACT) were localized (Figure S8). These clusters were highly conserved in cucumber (chromosome 6), watermelon (chromosome 6) and melon (chromosome 11)(Zhou et al., 2016). The syntenic regions containing La490 and La890 (Figure S8) were also conserved except for the presence of two paralogous genes in the watermelon genome (Cl890A and Cl890B). The colinear regions encompassing La510 and La560 exhibited a lower degree of synteny among Luffa, cucumber watermelon and melon (Figure S8) with full-length Cs540 and Cs550 orthologs being truncated in melon and in missing in Luffa and watermelon.
4 DISCUSSION
Here, we present de novo genome assemblies of two cultivated Luffa species (L. acutangula and L. cyclidrica) obtained from long-read PacBio sequencing. We obtained preliminary draft genomes of 734.6 Mb and 689.8 Mb with scaffold N50 of 786,130 and 578,616 bases for L. acutangula and L. cylindrica, respectively. The sizes of the preliminary assemblies were 96.5% and 89% of the estimated genome sizes in L. acutangula and L. cylindrica, respectively, based on k-mer analyses. A previous publication on L. cylindrica genome assembly also reported that the assembly size (669 Mb) was smaller than the estimated genome size (737 Mb) based on k-mer analysis (91% of the estimated size) (Zhang et al., 2020). The discrepancy observed between the assembly size and the k-mer size estimation could be due to a large number of repetitive sequences (Pflug, Holmes, Burrus, Spencer Johnston, & Maddison, 2020). We further assembled L. acutangula genome using the long-range Chicago and HiC techniques. The final assembly is the first high-quality, chromosome-scale genome assembly in L. acutangula. The numbers of protein-coding genes in L. acutangula and L. cylindrica obtained for our assemblies were similar to the figure previously reported for L. cylindrica genome (Zhang et al., 2020), while the proportions of repetitive sequences identified in our genome assemblies (62% for L. acutangula and 57% for L. cylindrica) were slightly lower than the percentage reported for L. cylindrica (73%) in Zhang et al. (2020). The availability of this assembly enabled comparative genomics/phylogenetic studies of the Cucurbitaceae family members. The sequence information from single-copy orthologous genes revealed that L. acutangula and L. cylindrica diverged approximately 7.9 MYA. The assembly also revealed no evidence supporting a recent whole-genome duplication event in Luffa, unlike in C. maxima and C. moschata. We also observed a substantial accumulation of transposable elements, especially the LTR retrotransposons, which likely contributed to the expansion of the Luffa genomes. Obtaining elite, nonbitter varieties for human consumption is one of the major goals for Luffa breeding programmes. The availability of Luffa genome assemblies facilitated the identification of putative cucurbitacin biosynthetic genes. Collinearity analysis among cucurbit species showed that the Bi clusters as well as other regions encompassing CYP genes exhibited synteny among Luffa, cucumber, watermelon and melon. Our high-quality genome assemblies along with the genomic variation information from L. acutangula and L. cylindrica germplasms provide invaluable resources for studying marker-trait association at a whole-genome level and for future comparative genomics and phylogenetic studies in Cucurbitaceae.
ACKNOWLEDGEMENTS
This study was supported by the National Omics Center under the National Science and Technology Development Agency, Thailand, grant number: 1000221.
CONFLICT OF INTEREST
The authors declare no competing financial interests.
AUTHOR CONTRIBUTIONS
Research study was designed by W.P. and S.T. Laboratory work was performed by W.P, T.Y., D.S., N.J., S.U., J.R.S., J.B. and S.M. (sample collection, DNA and RNA extraction, library construction and sequencing). Bioinformatics analyses were performed by C.S., C.N., W.N. and W.K. The manuscript was written and revised by W.P. and all authors reviewed it.
Open Research
DATA AVAILABILITY STATEMENT
L. acutangula and L. cyclindrica genome assemblies and Iso-seq data have been submitted to the DDBJ/EMBL/GenBank databases under the following accession numbers: JAATNF000000000 (L. acutangula genome assembly), SRR11445640 (L. acutangula Iso-seq data), JAAVXE000000000 (L. cylindrica genome assembly) and SRR11452010 (L. cylindrica Iso-seq data). The scripts used to perform the assembly and analyses of the Luffa genomes are available at https://github.com/BeeKento/Luffa-acutangula-genome.