Multi-omics provides insights into genome evolution and betacyanin biosynthesis in the halophyte Suaeda salsa
Abstract
As an important halophyte in the Yellow River Delta, the Amaranthaceae C3 Suaeda salsa (L.) Pall. has attracted much attention for the “red carpet” landscape, and could be simply divided into red and green phenotypes according to the betacyanin content in the fleshy leaves. However, S. salsa has not been sequenced yet, which limited people's understanding of this species at the molecular level. We constructed a high-quality chromosome-scale reference genome by combining high-throughput sequencing, PacBio single molecule real-time sequencing, and Hi-C sequencing techniques with a genome size of 445.10 Mb and contigs N50 of 2.94 Mb. Through the annotation of the S. salsa genome, 298.76 Mb of the repetitive sequences and 23 965 protein-coding genes were identified, of which the proportion of long terminal repeats type in the repetitive sequences was the most abundant, about 50.74% of the S. salsa genome. Comparative genomics indicated that S. salsa underwent a whole-genome duplication event about 146.15 million years ago (Ma), and the estimated divergence time between S. salsa and Suaeda aralocaspica was about 16.9 Ma. A total of four betacyanins including betanidin, celosianin II, amaranthin and 6′-O-malonyl-celosianin II were identified and purified in both phenotypes, while two significantly up-regulated betacyanins (celosianin II and amaranthin) may be the main reason for the red color in red phenotype. In addition, we also performed transcriptomics and metabolomics in both phenotypes to explore the molecular mechanisms of pigment synthesis, and a series of structural genes and transcription factors concerning with betacyanin production were selected in S. salsa.
1 Introduction
As a halophyte, the Amaranthaceae C3 Suaeda salsa (L.) Pall. (2n = 2x = 18) is an annual herb with a wide distribution in the Northeast, North and Northwest saline alkali areas of China. S. salsa plays important roles in ecological restoration and is also the pioneer plant grown in the intertidal zones of the Yellow River Delta Nature Reserve (YRDNR) (Zhang et al., 2003). This species can be used as potential high-quality vegetable and oil crops because of the high nutrient content (Zhang et al., 2008). Interestingly, S. salsa grown in the YRDNR could be simply divided into red phenotype (RP) and green phenotype (GP) according to the content of betacyanins accumulated in the leaves, and thus it has attracted much attention for the famous “red carpet” landscape in the YRDNR every year (Liu et al., 2006) (Fig. 1). RP usually grows in the intertidal zone and is mainly affected by salt stress, while GP grows in the areas with higher terrain and is largely influenced by drought stress (Wang et al., 2006).

The red pigments accumulated in the leaves of S. salsa, namely betacyanins, are a type of red-violet, water-soluble colorants (Stintzing & Carle, 2007) with strong antioxidant capacity (Kanner et al., 2001). The potential health-promoting properties of betacyanins have also been extensively explored (Gandía-Herrero et al., 2016), such as anticancer, hypolipidemic, hepatoprotective, anti-inflammatory, and antidiabetic functions (Clifford et al., 2015; Gengatharan et al., 2015). In addition, betacyanins are also used as natural food pigments as their stability and nutritional values in the food industry (Azeredo, 2009), and the color of betacyanins is more stable than anthocyanins (Tanaka et al., 2008). Although environmental factors (such as drought, low temperature and salinity) can induce the production of anthocyanin and betacyanin, their biosynthetic pathways are completely different (Jain & Gould, 2015), and no studies have previously declared that these two pigments could coexist in the same species so far. The capacity to produce betacyanins has been found within only one order, the Caryophyllales, excluding the families Caryophyllaceae and Molluginaceae (Gandía-Herrero & García-Carmona, 2013). Moreover, researchers also found that some fungi such as Amanita muscaria could produce betacyanins (Gill, 1994). To date, betacyanins remained little known due to the limited distribution in nature (Belhadj et al., 2017).
Several enzymes have been stated to be crucial in the betacyanin synthesis. Polyphenol oxidase (PPO) is a copper-type bifunctional enzyme which could catalyze the conversion of tyrosine to 3-hydroxy-l-tyrosine (DOPA) (Strack et al., 2003); DOPA could be converted into betalamic acid under the catalysis 4,5-DOPA extradiol dioxygenase (DODA) (Sasaki et al., 2005); a cytochrome P450 enzyme known as CYP76AD1 could catalyze the production of cyclo-DOPA (Hatlestad et al., 2012). In addition, a series of glycosyltransferases (GTs) have been proved to be related to the plant second metabolism, and complementary DNAs (cDNAs) encoding betanidin 6-O-glucosyltransferase (B6GT) (Heuer et al., 1996), betanidin 5-O-glucosyltransferase (B5GT) (Hans et al., 2004) and cyclo-DOPA 5-O-glucosyltransferase (cDOPA5GT) (Sasaki et al., 2005) have been selected in Caryophyllales species, and their expressions were also consistent with the increase of betacyanins. Furthermore, numerous important transcription factors (TFs) were also reported to play a vital regulatory part involved in betacyanin production. BvMYB1 could regulate the transcription of CYP76AD1 and DODA, thereby sensitizing the betacyanin biosynthetic pathways in beet (Hatlestad et al., 2012); while HpWRKY44 belonging to WRKY family could take part in the betacyanin biosynthesis in pitaya fruit by activating an HpCytP450-like enzyme (Cheng et al., 2017).
Although S. salsa has great economic and nutritional value due to the high content of betacyanin, little information is available about its genome. The exploration of the molecular mechanisms concerned with betacyanin biosynthesis in S. salsa has been greatly impeded by the lack of a high-quality genome assembly. Here we show a genome assembly of S. salsa at the chromosome level depending on highly accurate long sequencing reads (high fidelity reads, HiFi reads) generated by the Pacific Biosciences (PacBio, Menlo Park, CA, USA) sequel platform and high-throughput chromosome conformation capture (Hi-C) techniques (Belton et al., 2012; Korlach et al., 2017), so as to get the information about the gene family expansion and contraction analysis, divergence time estimation of S. salsa and so on. We also selected some key genes involved in the betacyanin biosynthesis and constructed the gene modulatory network of betacyanin in S. salsa via the combination of genomics, transcriptomics and metabolomics, so as to provide a fundamental information for yield improvement of valuable betacyanins in S. salsa.
2 Material and Methods
2.1 Plant material conditions
For genome sequencing, fresh and tender leaves of red phenotype were collected from the YRDNR (Dongying City, China), which is a temperate monsoon climate zone with obvious continental monsoon climate (Bai et al., 2017). We chose the eastern area of YRDNR as our sampling site, where the vegetation is mainly S. salsa. For performing transcriptome and metabolome analysis, leaves of about 1.5 cm long collected in May were made as one biological replicate, and all the leaf samples were stored in a liquid nitrogen tank before DNA and RNA extraction.
2.2 Genome sequencing
We used the modified CTAB method to extract high-quality genomic DNA from both phenotypes. DNA was then detected using the NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA), and the remaining DNA samples were stored in a −20°C freezer for constructing a sequencing library.
To assess the size, heterozygosity ratio, and repeat sequence ratio of the genome of S. salsa, high-throughput sequencing was first performed through MGI-seq. 2000 platform. 1 μg of DNA was used to construct library with the MGI DNA Library Universal Kit (Vazyme, Nanjing, China) according to the manufacturer′s instruction, and then the library was sequenced on the MGI-seq. 2000 platform by Wuhan Frasergen Bioinformatics Information Co., Ltd. (Wuhan, China), and about 64.57 Gb of raw reads were produced. G-TUBE (Covaris, Woburn, MA, USA) was used to randomly break DNA into fragments of about 15 kb so as to construct a reference genome for S. salsa. We adopted the SMRT bell Express Template Prep kit 2.0 reagent (Pacific Biosciences, Menlo Park, CA, USA) to build the SMRT bell HiFi library on the platform for 30 h by Wuhan Frasergen Bioinformatics Information Co. Ltd. (Sun et al., 2020), and 293.35 Gb of subreads were produced ultimately.
To construct a reference genome at the chromosome level, Hi-C sequencing data was used for anchoring contigs. Specifically, fresh tissue samples from the same individual of the genome assembly were immersed in formaldehyde for crosslinking, and the isolated nuclei was purified, blunt-end-repaired, and tagged with biotin. The biotin-labeled DNA was then captured with the aim of constructing a library, and the library was applied to the MGI-seq. 2000 platform sequencing, which eventually yielded 49.28 Gb of raw data.
2.3 Data quality control and genome survey
The short reads from high-throughput sequencing data were filtered with the following method: first, the adaptors were removed from the sequencing reads; second, read pairs were excluded if any one end has an average quality lower than 20; third, ends of reads were trimmed if the average quality lower than 20 in the sliding window size of 5 bp; lastly, read pairs shorter than 75 bp were removed to get the clean data. After obtaining clean data, we used the K-mer (K = 17)-based analysis method to estimate genome size and heterozygosity as well as repetitive sequence ratio by GCE tools (Liu et al., 2013).
2.4 Genome assembly
The PacBio SMRT sequencing led to 318.4 Gb subreads in all, which were converted into HiFi reads data using the ccs (https://github.com/pacificbiosciences/unanimity) command “-minPasses 3,” and about 19.69 Gb HiFi reads (~44x) were generated based on these long fragments (>15 kb) and high-accuracy (99%) HiFi data. The HiFi data was then assembled using hifiasm software (Cheng et al., 2021), and we finally acquired a 445.10 Mb S. salsa genome assembly containing 336 contigs with a contig N50 size of 2.95 Mb.
About 110× raw data were generated by Hi-C library constructing and sequencing, and 49.28 Gb clean data were produced after quality control by HTQC (v1.92.310) software (Yang et al., 2013).The clean data was aligned to the draft assembly by BWA software (MME version number: 0.7.16a-r1181) (Li, 2013), while the sequence was filtered by 3D-DNA software (Dudchenko et al., 2017) for chromosome construction. The contigs were anchored to the scaffold sequence by 3D-DNA, and a genome-wide contact matrix was constructed and visually corrected using JuicerBox (Durand et al., 2016), and the interrupts and corrections would be required if internal contig assembly errors were found in the assembly. 366 contigs with errors were corrected by being broken into 339 shorter contigs in all, and these 339 contigs were sorted and anchored to construct a genome of 422.1 Mb in size at the chromosome level eventually.
2.5 RNA sequencing for genome function annotation
We collected multiple different tissues of buds, roots, leaves, and seeds of S. salsa to make mixed samples for genome functional annotation. Samples were mixed together and thoroughly ground, and then taken for RNA extraction and library construction. 2 µg of RNA extracted by QIAGEN Total RNA Extraction Kit (QIAGEN, Hilden, Germany) of each sample was used for library constructing and sequencing. Ribo-zero kit was adopted to remove rRNA and enrich mRNA, and purified cDNA was amplified and then sequenced on MGI-SEQ. 2000 platform.
2.6 Genome quality evaluation
After all the assembly was completed, we evaluated the assembly results in the following three ways. First, the HiFi data was mapped to the assembled reference genome by alignment software Minimap2 (v2.5 default parameter; Li, 2018), and the reads alignment rate, genome coverage, and sequencing depth were then calculated, thereby assessing the completeness of the assembly and the uniformity of coverage. Second, the short reads were aligned back to the assembled reference genome by BWA software (Li, 2013) so as to estimate the alignment rate, while the GATK software (McKenna et al., 2010) was adopted for detecting the heterozygous and homozygous singe nucleotide polymorphisms (SNP). Finally, the integrity of the gene region in the entire assembly result was assessed with Benchmarking Universal Single-Copy Orthologs (BUSCO, V3.0.2) based on the single-copy homologous gene set in OrthoDB, and the integrity (Simão et al., 2015), degree of fragmentation, and possible loss rate of the genome assembly was also calculated statistically.
2.7 Genome annotation
The two methods including homologous and de novo annotations were adopted to identify repetitive sequences in the S. salsa reference genome. First, we used RepeatMasker (open-4.09) and RepeatProteinMask (open-4.09) software to search for TE sequences in the Repbase database (release 21.01) (http://www.repeatmasker.org; Bao et al., 2015) based on homology; second, we adopted TRF (Benson, 1999), RepeatModeler (open-1.0.11) (Flynn et al., 2020) and LTR-FINDER (v1.0.5) (Xu & Wang, 2007) software to construct a de novo repetitive sequence database, and then used RepeatMasker software to identify the repeat contents with the combination of library files generated by these two methods.
Homology-based prediction, transcriptome-based prediction and de novo prediction were used to identify the structure of protein-coding genes. First, the Arabidopsis thaliana, Beta vulgaris, Chenopodium quinoa, Fagopyrum tataricum, Spinacia oleracea, Suaeda aralocaspica, and Suaeda glauca′s protein-coding sequences were used as inputs of exonerate software to perform homologous gene predictions; second, we performed the RNA assembly using the Trinity tool (Grabherr et al., 2011) based on the high-throughput sequencing of RNA-seq data, and the assembled sequence was used for gene structure prediction by PASA software (Haas, 2003); third, we used Augustus software for gene structure de novo predictions. These three independent methods were combined to result in 23 965 protein-coding genomes using Maker software (Holt & Yandell, 2011). To identify the function of these protein-coding genes, the protein sequences of these genes were aligned to the published protein's databases including NR (Kim et al., 2005), TrEMBL (Boeckmann, 2003), InterPro (Mitchell et al., 2015), Swiss-Prot (Boeckmann, 2003), BLASTP (v2.6.0+) (Camacho et al., 2009), Kyoto Encyclopedia of Gene and Genomes (KEGG) database (Kanehisa et al., 2012) by DIAMOND software (Buchfink et al., 2021) with an e-value threshold of 1e−5.
As for the annotation of non-coding RNAs, tRNAscan-SEs (v1.3.1) software (Lowe & Eddy, 1997) was utilized to find tRNA sequences; rRNAs were identified by aligning the rRNA template sequences with the genome using the BLASTN; the Infernal software (Nawrocki & Eddy, 2013) that came with Rfam can predict miRNA and snRNA sequence information on the S. salsa genome.
2.8 Phylogenetic analysis
To construct the evolutionary tree of the S. salsa, we first clustered the protein sequences of the S. salsa and the other 14 species through the OrthoMCL (v14-137) tools (Li et al., 2003) based on sequence similarity, where the BLASTP (Camacho et al., 2009) e-value threshold was set to 1e−5. Second, we filtered the aligned results for identity < 30% or coverage < 50%, and the expansion coefficient of the MCL clusters in OrthoMCL was set to 1.5. Third, 296 common single-copy genes were obtained after gene clustering, and the proteins with amino acid lengths of less than 100 bp were filtered out, while the filtered 287 common single-copy genes were used as inputs to the MUSCLE (v3.8.31) software (Edgar, 2004) for multi-sequence alignment. Finally, the evolutionary tree was built by the software RaxML (v8.2.11) with the maximum likelihood (Stamatakis, 2014), and the divergence time of species for all pairs in the phylogenetic tree was estimated using the software r8s (v1.71) (Sanderson, 2003) and the mcmctree program in the PAML (v4.9e) software package (Yang, 2007).
2.9 Gene family, synteny analysis, and whole-genome duplication (WGD) analysis
To identify gene family expansion and contraction in S. salsa, CAFE software (Han et al., 2013) was performed to search for gene family expansion and contraction events in each lineage with Q-value < 0.05. Those identified expansion and contraction gene families were mapped to Gene Ontology (GO)/KEGG enrichment analysis by Fisher′s exact test, and the P-value was corrected by false discovery rate (FDR) multiple tests to calculate the Q-value. GO/KEGG pathways with Q-value (FDR) < 0.05 were defined as significantly enriched pathways in this study.
To understand the genome evolution of S. salsa, we performed a synteny gene analysis among the genome of S. salsa, B. vulgaris and F. tataricum, and a whole-genome comparison between their genomes was performed by MCscan software (http://chibba.agtec.uga.edu/duplication/mcscan/). We evaluated the whole-genome duplication events of species in the evolutionary history by calculating the 4DTv (transversion of four-fold degenerate site) value of the gene pairs contained in the synteny segment. Specifically, we extracted the paralogous and orthologous gene pairs from syntenic blocks among these three species to further calculate the 4DTv distances using the HKY substitution model, and the WGD events in each genome were evaluated according to their 4DTv values. The 4DTv peak position of the species itself represented the whole-genome duplication event of the species, while the 4DTv peak position between species implied the relative divergence time between different species. The final divergence time (T) of a WGD was estimated by the formula T = Ks/2r (2r = 0.013), in which Ks was the peak value of the corresponding WGD (Zhang et al., 2017).
2.10 Identification of DODA, CYP76AD, cDOPA5GT, and B5GT
CYP76AD proteins (GenBank accession: KR376350-KR376501, HQ656024-HQ656026) from ref. (Hatlestad et al., 2012; Sheehan et al., 2020) were used as queries to collect homologs; DODA proteins of B. vulgaris (GenBank accession: HQ656027), Portulaca grandiflora (GenBank accession: AJ580598), and Mirabilis jalapa (GenBank accession: AB435372) combined with the DODA protein sequences (GenBank accession: KR376141-KR376346) (Brockington et al., 2015) were used as queries to collect homologs; cDOPA5GT and B5GT proteins collected from ref. (Vogt, 2002; Sasaki et al., 2005) were used as query to collect homologs. These homologs were identified based on a hidden Markov model (HMM) using the HMMER program (Finn et al., 2011) and filtered with domain e-value < 10−6 and coverage ≥ 0.8 (alignment length/HMM length). Phylogenetic analyses were conducted in the MEGA 7 program (Zheng et al., 2021) to understand evolutionary connections and to classify the genes encoding DODA and CYP76AD1 proteins, and all coding sequence sequences were aligned using the default settings of Clustlaw (Brockington et al., 2015). The phylogenetic tree was generated using the maximum likelihood method with 1000 bootstrap replicates and iTOL (https://itol.embl.de/) was also used to clean up the tree.
2.11 Analysis of differentially expressed isoforms (DEIs) between GP and RP
We collected the leaves of GP and RP and made three biological replicates to obtain a total of six samples, and the RNA was extracted from the samples to construct six libraries. Reads were mapped by bowtie2 and the normalized expression level (FPKM) of each gene and transcript were quantified by RSEM (Li & Dewey, 2011).
The selection of DEIs between RP and GP differential expression was performed by DESeq. 2 (Love et al., 2014). We also chose Benjamini and Hochberg's method to calculate FDR, and the DEIs were then recruited by log2(FC) ≥ 1 or ≤ −1 while Q-value ≤ 0.05. Cluster Profiler (Young et al., 2010) was used to perform GO (Yu et al., 2012) enrichment analysis of DEIs and KEGG (Kanehisa et al., 2007) was also used for the enrichment of DEIs by KOBAS (Mao et al., 2005).
2.12 Real-time quantitative polymerase chain reaction (RT-qPCR) validation.
Here we chose 16 DEIs for RT-qPCR validation, and the specific primers were designed and provided from Invitrogen (Beijing, China, Table S1). RT-qPCR assays with three biological duplications were implemented on a real-time PCR system (Applied Biosystems; ABI 7500, Waltham, MA, USA). The method with reference gene actin (Livak & Schmittgen, 2001) was used to estimate the relative gene expression levels.
2.13 Extracting process of betacyanins
25 kg of plant material was extracted with acid water (pH = 5) at room temperature for 3 days (m/v = 1/3). The extract was subjected to resin (Diaion HP2MGL, Tokyo, Japan) and eluted with acid water (pH = 5) to obtain fraction YDRL-D0. YDRL-D0 was separated by HPLC (column size: 80 × 500 mm). The mobile phase was acetonitrile-water. Fractions YDRL-D0-F2 and YDRL-D0-F3 were obtained. Gradient: 0% B for 5 min, 0%~55% B over 45 min, 100% B for 20 min; flow rate: 100 mL/min. YDRL-D0-F2 was separated by sephadex LH-20 (MeOH-H2O, 1:1, 0.5%TFA). Fraction YDRL-D0-F2-S2 was obtained. YDRL-D0-F2-S2 was separated by MPLC (column size: 20 × 500 mm). The mobile phase was acetonitrile-water (0.1%TFA). Fraction YDRL-D0-F2-S2-F1 was obtained, and then further separate by sephadex LH-20 (MeOH-H2O, 1:1, 0.5%TFA) and Pre-HPLC (acetonitrile-water, 0.1%TFA), and then amaranthin (10 mg, purity: 97.95%) was obtained. MPLC gradient: 0% B for 5 min, 0%~25% B over 25 min, 90% B for 10 min; flow rate: 30 mL/min. Pre-HPLC gradient: 1%~10% B over 12 min, 100% B for 2 min; flow rate: 25 mL/min. YDRL-D0-F3 was separated by MPLC (column size: 20 × 500 mm). The mobile phase was acetonitrile-water (0.1%TFA). Fractions YDRL-D0-F3-F1 and YDRL-D0- F3-F2 were obtained. Gradient: 0% B for 5 min, 0%–25% B for 25 min, 90%–90% B for 10 min; Flow rate: 30 mL/min. YDRL-D0-F3-F1 was purified by Pre-HPLC. The mobile phase was acetonitrile- water (0.1%TFA). Fractions YDRL-D0-F3-F1-P1 and celosianin Ⅱ (21 mg, purity: 95.38%) were obtained. YDRL-D0-F3-F1-P1 was separated by sephadex LH-20 (MeOH-H2O, 1:1, 0.5% TFA) and Pre-HPLC (acetonitrile-water, 0.1%TFA), so as to get 1 mg betanidin (purity: 94.27%). YDRL-D0-F3-F2 was purified by Pre-HPLC. The mobile phase was acetonitrile-water (0.1%TFA), and thus 6′-O-malonyl-celosianin Ⅱ (12 mg, purity: 97.04%) was extracted. Pre-HPLC gradient: 10%–20% B over 12 min, 100% B for 2 min; flow rate: 25 mL/min.
2.14 Widely targeted metabolomics of GP and RP
The lyophilized leaves were ground (30 Hz, 1.5 min) to powder with a grinder (MM400; Retsch, Haan, Germany); 0.1 g of the sample was dissolved in 1.0 mL of extract (containing 0.1 mg/L lidocaine in 70% methanol in water). Here we stored the extracting solution in a refrigerator at 4 °C for 24 h, so as to improve the extraction rate of the samples. Following the centrifugation (10 000 g, 10 min, 4 °C), the extracts were absorbed (CNWBOND Carbon-GCB SPE Cartridge, 250 mg, 3 mL; ANPEL, Shanghai, China) and filtrated (0.22 μm pore size) before LC-MS analysis.
The specific models of data acquisition instruments were as follows: ultra-high performance liquid chromatography (Shim-pack UFLC SHIMADZU CBM30A, Kyoto, Japan) and triple quadrupole tandem mass spectrometry (tandem mass spectrometry, MS/MS), Applied Biosystems 6500 QTRAP and Applied Biosystems Sciex Triple TOF 6600+ (Waltham, MA, USA).The specific parameters were as follows: (i) the model of the chromatographic column was Waters ACQUITY UPLC HSS T3 C18 1.8 µm (Milford, MA, USA), and its specification was 2.1 mm × 100 mm; (ii) in the mobile phase, the aqueous phase A was ultrapure water (adding 0.04% acetic acid), and the organic phase B was acetonitrile; (iii) the specific parameters of the elution gradient were as follows: A/B was 95/5 (V/V) at 0 min, A/B was 5/95 (V/V) at 11.0 min), A/B was 5/95 (V/V) at 12.0 min, A/B was 95/5 (V/V) at 12.1 min, A/B was 95/5 (V/V) at 15.0 min V); (iv) the column flow rate was 0.4 mL/min, and the column temperature was 40 °C; and (v) the sample injection volume was 2 μL. The parameters of mass spectrometry were as follows: the temperature parameter of electrospray ionization was set to 500 °C, the voltage parameter of mass spectrometry was set to 5500 V, the ion source gas I (GS I) and the gas II (GS II) parameter were set to 55 and 60 psi, respectively, and the collision-activated dissociation parameter was set to high (Chen et al., 2013).
The metabolite information public databases such as Metlin DataBase, MassBank DataBase, MassFrontier, MassBank, and Metaware Database were combined with the secondary spectrum data of metabolites for qualitative analysis, and the metabolite quantification was carried by MRM mode (Fraga et al., 2010). Our study screened out metabolites with fold change (FC) ≥ 2 and ≤ 0.5 at first, and then selected metabolites with variable importance in project ≥ 1 as the differential metabolites (DMs).
3 Results and Discussion
3.1 S. salsa genome assembly
To assess the basic information of the genome, 64.57 Gb of raw data underwent quality control and we finally obtained 61.17 Gb of clean data. The estimated genome size of S. salsa was about 431.21 Mb with the heterozygosity rate of 0.90%, and the proportion of repetitive sequences was 60.43% with the GC content of 35.5% based on K-mer distribution assessment (K = 17) (Fig. S1; Table S2). The statistics of MGI short reads for survey, HiFi sequencing and Hi-C sequencing were presented in Tables S3, S4.
To obtain a draft genome assembly of the S. salsa, 19.69 Gb HiFi reads were corrected and assembled to produce a 445.10 Mb genome by hifiasm software with a contig N50 size of 2.94 Mb. These contigs were interrupted, sorted and clustered based on the Hi-C technology, and finally nine chromosome-scale sequences were produced including 339 contigs, with a size of 422.10 Mb, accounting for 94.87% of the draft assembly size (Fig. 2; Table S5).

To assess the quality of genome assembly, The PacBio SMRT sequencing HiFi reads were first re-aligned to the chromosome-scale reference genome by minimap2 software. The results showed that 98.09% of HiFi reads were aligned to the assembled reference genome, which covered 99.95% of the whole genome (Table S6). At the same time, the short-read data was also aligned to the assembled reference genome and 2561 homozygous SNPs (about 0.0006% error rate) has been identified by the GATK tool, which implied the assembling accuracy of the genome was about 99.999%. Finally, the genome evaluation was performed using BUSCOs, and the results showed that the complete BUSCOs were up to 95.40% (Table S7), and the interactive heatmap also showed high quality assembly results for this species (Fig. S2). In summary, the S. salsa genome we assembled is a chromosome-level reference genome of high quality (Tables 1, S6).
Assembly feature | Value |
---|---|
Size of the assembly (Mb) | 445.10 |
Contig number | 366 |
Contig N50 (Mb) | 2.71 |
Scaffold N50 (Mb) | 47.37 |
Number of superscaffold chromosomes | 9 |
Assembled superscaffold chromosome size (Mb) | 422.26 |
Assembled superscaffold chromosome contigs | 339 |
Complete BUSCOs in genome | 95.40% |
Heterozygosity | 0.90% |
Number of genes | 23 965 |
Complete BUSCOs in annotation | 93.40% |
3.2 Genome annotation
A total of 298.76 Mb of repetitive sequences were identified by repeat annotation in the S. salsa reference genome, accounting for 67.12% of the entire genome (Table S8). Complicated transposable element (TE) annotation showed that 64.76% of the S. salsa genome assembly was comprised of TEs, and long terminal repeats type of repetitive sequences were the most abundant with a total of 225.83 Mb, accounting for 50.74% of the whole genome, while DNA TEs ranked second, occupying 9.79% of the S. salsa genome. Unknown repeats, long interspersed nuclear elements and short interspersed nuclear elements occupied 4.76%, 4.04%, and 0.08% of the genome, respectively (Table S9). As for non-coding RNAs of S. salsa genome, about 69 miRNAs, 628 tRNAs, 567 rRNAs and 176 snRNAs were predicted (Table S10). By combining de novo prediction, homologous protein prediction, and transcriptome-assisted prediction approach, 23 965 protein-coding genes and 33 870 transcript isoforms were identified in S. salsa reference genome by combining de novo prediction, homologous protein prediction and transcriptome-assisted prediction approach (Table S11). The average length of these genes was 6347 bp, and each gene contained an average of five exons. A total of 23 094 genes were annotated through gene function annotation, which meant 96.37% of protein-coding genes had been predicted (Table S12).
To assess the accuracy of the protein-coding genes, the differences of gene feature (including gene number, length distribution, GC content, exon number distribution, etc.) were compared with related species (S. aralocaspica, S. glauca, B. vulgaris). The results showed that the main missing parts of S. salsa were some short genes, especially genes that below 1000 bp (Fig. S3), and the BUSCO assessing results also showed the genome annotation was of good quality (Table 1).
Recent research stated that low genomic GC content may be concerned with plant adaptation to severe nutrient environments (Wan et al., 2021). Here we recognized that GC content of the S. salsa genome was very low (35.5%), with 793 genes exhibiting a GC content of < 32.0%, which was lower than that of non-exrophytic species in Amaranthaceae, such as S. oleracea (37.9%), B. vulgaris (35.9%), and Chenopodium quinoa (36.9%). These low-GC genes were highly enriched in “cutin, suberine and wax biosynthesis,” “naphthalene degradation,” “chloroalkane and chloroalkene degradation,” “novobiocin biosynthesis,” and “methane metabolism” by KEGG enrichment (Fig. S4A), and these genes might have contributed to the saline-alkali adaptation of S. salsa. We also found that some low-GC genes encoding DODA (Ssal003T0098400.1 and Ssal003T0098400.2) and 5GT (Ssal_Un018T0017800.1) were enriched in “betalain biosynthesis” (Q-value < 0.05), while gene Ssal_Un018T0017800.1 encoding anthocyanidin 3-O-glucosyltransferase was also enriched in “anthocyanin biosynthesis” (Q-value < 0.05).
3.3 Comparative genomics
To identify the evolutionary status and the divergence time of the S. salsa compare with related species, we first homologously aligned the protein-coding sequences of S. salsa with the remaining 14 species (including Nymphaea colorata, Rosa chinensis, Solanum tuberosum, Oryza sativa, Phalaenopsis equestris, Arabidopsis thaliana, Vitis vinifera, Dianthus caryophyllus, F. tataricum, Atriplex hortensis, Amaranthus hypochondriacus, S. oleracea, B. vulgaris, and S. aralocaspica). The results showed that 23 965 genes of S. salsa could be clustered into 12 548 gene families with 1.85 genes per family, and among these gene families, 311 families were unique families of S. salsa, while 296 genes were distinguished as single-copy ortholog genes (Fig. S4B; Table S13).
We retained 287 common single-copy homologous genes with protein coding lengths of not less than 100 bp for multi-sequence alignment, so as to construct phylogenetic trees. The evolutionary tree of these genes was then constructed by RAxML software, and here we calculated the calibration time of divergence in the TimeTree (http://www.timetree.org/) for O. sativa–N. colorata (174–203 million years ago [Ma]), O. sativa–P. equestris (104–125 Ma), O. sativa–A. thaliana (115–308 Ma), R. chinensis–A. thaliana (98–117 Ma), A. thaliana–V. vinifera (107–135 Ma), S. salsa–F. tataricum (73–91 Ma), S. salsa–D. caryophyllus (40–65 Ma), S. salsa–A. hypochondriacus (19–58 Ma), S. salsa-B. vulgaris (40–62 Ma), B. vulgaris–S. oleracea (22–61 Ma) and S. oleracea–A. hortensis (18–57 Ma), which could help us to evaluate the divergence time more accurately as this website was a database with a large number of known species’ divergence times. Among the betacyanin-produced species, S. salsa and S. aralocaspacica was in the same genus, and in addition, S. salsa was closely related to B. vulgaris and A. hypochondriacus. The common ancestor of S. salsa and B. vulgaris diverged from A. hypochondriacus at about 45.6 Ma, followed by S. Salsa and S. aralocaspacica which diverged around 16.9 Ma (Fig. 3A).While comparing gene families in S. salsa with other three betacyanin-producing species including B. vulgaris, A. hypochondriacus and S. oleracea, we found that 82.01% (10 290/12 548) of the gene families in S. salsa were shared among all four genomes, and only 6.30% (791/12 548) of gene families were S. salsa specific (Fig. S4C). GO enrichment analysis showed significantly enriched GO items including “cellular response to water deprivation and stimulus,” “serotonin metabolic process,” “mitotic cell cycle” (Fig. S4D), and these biological processes were likely to have a very important relationship with the salinity and alkali resistance of S. salsa.

Here we used MCScanX to assess the synteny among F. tataricum, B. vulgaris and S. salsa within Caryophyllales, and found high homology sequences between S. salsa and B. vulgaris for nine chromosomes (Fig. 3B). The results showed that there were about 11 699 syntenic gene pairs between S. salsa and B. vulgaris, and the distribution of these gene pairs on chromosomes was relatively uniform, with an average of about 1260 syntenic gene pairs per chromosome (Fig. S5A). There were about 6105 syntenic gene pairs between S. salsa and F. tataricum (Fig. S5B), and the distribution of syntenic gene pairs was very heterogeneous, for example, there were almost no syntenic genes on chr2, chr 4 and chr7 of F. tataricum genome. At the same time, by comparing the syntenic genes within the genus Suaeda, the results showed that 14 015 syntenic gene pairs were identified between S. salsa and S. glauca (Fig. S5C), and 12 579 syntenic gene pairs were identified between S. salsa and S. aralocaspica in all (Fig. S5D). Although homologous genes for PPO, CYP76AD1, DODA and B5GT were identified in all three species, only the partial homologous genomes for PPO, DODA and B5GT were in collinear gene pairs.
There were 11 202 gene families in the most recent common ancestor (MRCA) among these 15 species, and about 962 gene families were significantly expanded while 1211 gene families were significantly contracted in S. salsa relative to MRCA of S. salsa and S. aralocaspica (Fig. S6A). The KEGG enrichment of expanded gene families were presented in Table S14 and Fig. S6B, while those of contracted gene families were shown Table S15 and Fig. S6C.
WGD is not only a process of genome doubling to increase the genome complexity, but also a powerful force driving the evolution of plant genome (Cui et al., 2006), and it is also a prominent feature of plant genomes (Van de Peer et al., 2009). We performed synonymous substitutions per site (Ks) and 4DTv distribution analysis by using synteny gene pairs among S. salsa, B. vulgaris and F. tataricum. The Ks distribution curves of B. vulgaris and S. salsa were basically the same, and the combination of the Ks distribution between them (Figs. 3C, S7) showed that the common ancestor of B. vulgaris and S. salsa had a whole-genome triplication (γ- WGT) at Ks value 1.9 (about 146.15 Ma) shared by core eudicots, and then they diverged into different species near the Ks value of 0.53 (about 40.77 Ma). F. tataricum first diverged from S. salsa and then underwent a WGD event with a Ks value of about 0.80 (about 61.54 Ma).
3.4 Identification of betacyanin biosynthetic pathway genes
The structural genes PPO, CYP76AD1, DODA, and B5GT have been reported to be very vital in the betacyanin biosynthesis (Polturak & Aharoni, 2018). In the S. salsa genome, Ssal001T0101700.1, Ssal001T0163500.1, and Ssal006T0211300.1 were identified as PPO. The DODA lineage experienced gene duplication and resulted in two main clades, termed DODA-α and DODA-β (Sheehan et al., 2020). In the S. salsa genome, Ssal002T0196300.1, Ssal003T0098400.2 and Ssal003T0145100.1 clustered in the DODA-α clade, Ssal002T0077300.1 clustered in the DODA-β clade (Fig. 4A). The CYP76AD1 lineage could be divided into three paralogous lineages, namely, CYP76AD1-α, CYP76AD1-β, and CYP76AD1-γ (Brockington et al., 2015). In our study, 11 genes were annotated as CYP76AD1, of which Ssal003T0052000.1 and Ssal003T0191900.1 grouped in the CYP76AD1-α clade, Ssal008T0029300.1, Ssal008T0035500.1, and Ssal008T0122400.1 grouped in the CYP76AD1-β clade, and the other six genes including Ssal004T0001100.1, Ssal004T0009300.1, Ssal005T0111500.1, Ssal001T0174600.1, Ssal005T0169600.2 and Ssal004T0076600.1 divided into the CYP76AD1-γ clade (Fig. 4B). As for betacyanin-related GT enzymes, Ssal004T0181500.1 were identified as cDOPA5GT while Ssal007T0038600.1, Ssal_Un018T0019100.1, Ssal007T0177300.1 and Ssal_Un018T0017800.1 were identified as B5GT.


In order to acquire the detailed genetic information about betacyanin biosynthesis, we performed transcriptomics based on our high-quality genome and widely targeted metabolomics between GP and RP. The transcriptome data presented 8345 DEIs between GP and RP, and 5344 isoforms were up-regulated while 3011 isoforms were down-regulated in RP compared with GP (Fig. S8). GO and KEGG enrichment of DEIs between GP an RP were shown in Figs. S9A, S9B, respectively. The RT-qPCR validation also proved the accuracy and reliability of the transcriptome data (Fig. S10). Ssal001T0101700.1, Ssal001T0163500.1, and Ssal006T0211300.1 annotated as PPO were significantly increased with the log2FC of 8.93, 5.23, and 8.21 in RP, respectively; Ssal003T0098400.1, Ssal003T0098400.2, and Ssal002T0196300.1 identified as DODA were significantly increased with the log2FC of 5.52, 7.63, and 1.33 in RP, respectively. Five DEIs (Ssal003T0052000.1, Ssal008T0029300.1, Ssal003T0191900.1, Ssal004T0001100.2, and Ssal004T0076600.1) encoding CYP76AD1 were selected as DEIs in RP, and Ssal003T0052000.1, Ssal008T0029300.1 and Ssal004T0001100.2 were only expressed in RP. Ssal007T0038600.1, Ssal007T0177300.1, and Ssal_Un018T0017800.1 annotated as B5GT were also significantly increased with the log2FC of 1.67, 11.55, and 5.23 in RP, respectively.
3.5 Combined metabolic and gene expression analysis reveals betacyanin biosynthetic pathway
We performed widely targeted metabolic profiling using leaf tissue of GP and RP, respectively, and both principal component analysis and correlation analysis showed strong repeatability of biological duplications in RP and GP (Fig. S11). Six hundred thirty metabolites were identified in both phenotypes, and there were 137 significantly up-regulated and 165 down-regulated differentially expressed metabolites in RP (Fig. S12), which were enriched in “retrograde endocannabinoid signaling,” “lysine degradation,” “linoleic acid metabolism,” “glycerophospholipid metabolism,” “flavonoid biosynthesis,” “flavone and flavonol biosynthesis,” and so on. A total of four betacyanins including betanidin, celosianin II, amaranthin and 6-O-malonyl-celosianin II were detected in both phenotypes (Fig. S13). As the precursor, the significant decrease in tyrosine content might be related to the synthesis of betacyanins in RP. Among these betacyanins, amaranthin and celosianin II were significantly up-regulated with the fold change of 39.73 and 23.94 in RP, respectively, and these two significantly differential betacyanins may be the main reason for the red color in RP (Fig. 4C). Almost all structure genes involved in the betacyanin synthesis, from upstream gene PPO to the downstream gene B5GT, were significantly up-regulated in RP, which was consistent with the increase of amaranthin and celosiain II, suggesting that these structural genes played important roles in betacyanin accumulation of S. salsa.
Combined metabolome and transcriptome analysis showed that a large number of DMs and DEIs in the samples with a high correlation of |r| > 0.8, and hence we calculated the pairwise correlation coefficient between the identified metabolites and isoforms. Of all significant correlations, metabolite content and isoform expression changed in the same direction with 18.39% of significant correlations, and in opposite directions with 18.27% of significant correlations (Fig. S14). The canonical correlation analysis of DEIs and DMs in betalain biosynthesis showed that the expression of isoforms Ssal003T0098400.1, Ssal003T0098400.2, and Ssal002T0196300.1 encoding DODA enzyme (K15777) and Ssal008T0062300.1 encoding DDC enzyme (K01593) were closely related to celosianin Ⅱ (mws2114) and amaranthin (mws2535) (Fig. S15).
3.6 Identification of potential TFs involved in betacyanin biosynthesis
Plant secondary metabolism plays a key role in plant-environment interaction, and related genes are regulated by a variety of TFs at the transcription level (Yang et al., 2012). At present, people have conducted in-depth research about the transcriptional regulation of anthocyanin synthesis, and a series of TFs involved in anthocyanin synthesis have been screened, including MYB, bHLH, AP2, and AP2/ERF (Yan et al., 2021). More importantly, the transcription of anthocyanin biosynthesis pathway was regulated by the highly conserved MYB-bHLH-WD repeat (MBW) transcription complex model (Andrea et al., 2020). However, there were few studies on the regulation of TFs about betacyanin biosynthesis, and only a few TFs related to the betacyanin synthesis have been identified. Intriguingly, the MBW complex has been implicated in betacyanin biosynthesis (Timoneda et al., 2019), but the regulatory mechanism remained largely unknown. Hatlestad et al. (2012) found BvMYB1 could regulate the betacyanin pathways, however, this R2R3-MYB gene cannot induce anthocyanin in Arabidopsis, which helped to facilitate betacyanin specificity. Therefore, a comprehensive identification of TFs related to the betacyanin production is very necessary, which could help us to deeply understand the molecular basis of the betacyanin biosynthesis in plants. The expression patterns of PPO, CYP76AD1, DODA and B5GT were closely linked to the expression patterns of 326 TFs belonging to 44 families, mainly ABI3VP1, MYB, bHLH, AP2-EREBP, NAC, and WRKY TFs, and other families that might play a major role in the production of betacyanin (Fig. 4D). Among these 36 MYBs, Ssal002T0033400.2, Ssal007T0026200.1, Ssal005T0213200.1, Ssal002T0209400.3, and Ssal005T0217400.1 were highly expressed in RP with the log2FC of 8.81, 2.47, 6.88, 6.17, and 6.01, respectively; as for the 37 TFs belonging to AP2-EREBP, Ssal009T0149100.2, Ssal009T0091800.1, Ssal005T0125300.1, and Ssal003T0140900.1 were up-regulated DEIs with the log2FC of 6.52, 6.09, 5.88, and 4.94, respectively. Moreover, 5 bHLHs (Ssal008T0077000.1, Ssal009T0031400.1, Ssal004T0178400.1, Ssal006T0209400.1, and Ssal004T0085200.1) and 12 isoforms (Ssal006T0067300.1, Ssal002T0057400.2, Ssal005T0145700.1, Ssal005T0145700.2, Ssal003T0180000.1, Ssal002T0065400.2, Ssal009T0015600.1, Ssal002T0065400.1, Ssal_Un018T0010900.4, Ssal005T0121700.1, and Ssal006T0096500.1, Ssal006T0142000.1) annotated as ABI3VP1 were also identified as up-regulated DEIs in RP compared with GP. These TFs and structure isoforms identified in our study will be of great value to clarify the transcriptional regulatory network of betacyanin biosynthesis.
As far as we know, the S. salsa genome in this study is the first chromosome-level high-quality genome of genus Suaeda. The genome of S. aralocaspica was published in 2019 (Wang et al., 2019), however, this genome was not assembled to the chromosome level, which greatly limited its applicability to research on genus Suaeda. With our S. salsa genome, comparative analysis revealed many findings such as WGD event, species divergence time, especially the betacyanin biosynthetic genes located in different chromosomes. The comparative genomics showed that the evolutionary status of S. salsa and S. aralocaspica was close at the molecular level, indicating a close genetic relationship between these two species. Therefore, this high-quality S. salsa genome assembled in this paper can provide detailed reference data for the research of genus Suaeda and its evolution in the future.
4 Conclusion
A high-quality, chromosome-level genome assembly of S. salsa was performed by PacBio, next generation sequencing platforms and Hi-C technology, and the final genome size of S. salsa is 445.10 Mb with scaffold N50 of 47.37 Mb. In our research, about 23 965 protein-coding genes were predicted, and 19 486 genes were annotated. Comparative genomic analysis showed a WGD event occurred in S. salsa about 146.15 ma. Here we constructed the betacyanin biosynthesis network and selected many candidate structure genes and TFs related to the betacyanin synthesis by the joint analysis of the transcriptome and metabolome. The genome, transcriptome, and metabolome data in this study could contribute to a better understanding of phenotypic differences of S. salsa and provide valuable gene resources for genetic improvement of betacyanin by molecular breeding strategies.
Acknowledgements
Our research was financially supported by the Funds for International Cooperation and Exchange of the National Natural Science Foundation of China-CONICYT (51961125201), the Joint Funds of the National Natural Science Foundation of China (U2006215) and the Introduction and Cultivation Plan of Youth Innovative Talent in Shandong Province Colleges, Research and Innovation Team of Coastal Wetland Ecological Protection and Restoration in the Yellow River Delta. We thank WuXi AppTec (Tianjin, China) for providing betacyanins extracted from S. salsa. We thank Wuhan Metware Biotechnology Co. Ltd. for widely targeted metabolomics. We thank Dr. Xiao-Man Jiang for instructions in data analysis.
Author Contributions
X.W., J.B., and J.X. designed the experiment. X.W., C.W., and D.W. collected the samples. X.W., W.W., J.L., and S.Y. drew the pictures. X.W. and J.X. wrote the paper. X.W., X.Y., and S.D. analyzed the data.
Open Research
Data Availability Statement
The whole-genome sequence data of S. salsa referred in this article have been deposited in the Genome Warehouse of the National Genomics Data Center, Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under the BioProject ID number PRJCA006671.