SCycDB: A curated functional gene database for metagenomic profiling of sulphur cycling pathways
Abstract
Microorganisms play important roles in the biogeochemical cycling of sulphur (S), an essential element in the Earth's biosphere. Shotgun metagenome sequencing has opened a new avenue to advance our understanding of S cycling microbial communities. However, accurate metagenomic profiling of S cycling microbial communities remains technically challenging, mainly due to low coverage and inaccurate definition of S cycling gene families in public orthology databases. Here we developed a manually curated S cycling database (SCycDB) to profile S cycling functional genes and taxonomic groups for shotgun metagenomes. The developed SCycDB contains 207 gene families and 585,055 representative sequences affiliated with 52 phyla and 2684 genera of bacteria/archaea, and 20,761 homologous orthology groups were also included to reduce false positive sequence assignments. SCycDB was applied for functional and taxonomic analysis of S cycling microbial communities from four habitats (freshwater, hot spring, marine sediment and soil). Gene families and microorganisms involved in S reduction were abundant in the marine sediment, while those of S oxidation and dimethylsulphoniopropionate transformation were abundant in the soil. SCycDB is expected to be a useful tool for fast and accurate metagenomic analysis of S cycling microbial communities in the environment.
1 INTRODUCTION
Sulphur (S) is an essential component of important biomolecules such as amino acids, vitamins and enzymes. S cycling is an important biogeochemical process in the Earth's biosphere (Fike et al., 2015; Moran & Durham, 2019; Muyzer & Stams, 2008; Wasmund et al., 2017), and is usually coupled with carbon (C), nitrogen (N) and metal cycling in natural ecosystems (Buongiorno et al., 2019; Landa et al., 2019; Zhu et al., 2018). Microorganisms play important roles in the biogeochemical cycling of S compounds, which are present in a large variety of chemical forms and redox states (Wasmund et al., 2017). S is abundant with active metabolism in diverse environments, such as marine sediments, hot springs, peatlands and coastal sediments (Baker et al., 2015; Hausmann et al., 2018; Lin et al., 2015; Wasmund et al., 2017). Characterizing the function and taxonomy of S cycling microbial communities is therefore of critical importance to understand microbially mediated S cycling processes and their regulatory mechanisms in the environment.
The S cycle consists of inorganic and organic S transformations. In inorganic S transformation, assimilatory sulphate reduction and dissimilatory sulphate reduction processes as well as their key functional genes such as sat, aprAB and dsrAB have been well studied (Müller et al., 2015), while other inorganic S forms, such as thiosulphate, tetrathionate, polysulphide and elemental S, need further clarification in terms of functional genes, pathways and associated microorganisms involved in biotransformation. Organic S transformation and the linkages between inorganic and organic S transformations are also important in the S cycle. As one of the most abundant organosulphur compounds in marine ecosystems, dimethylsulphoniopropionate (DMSP) is mainly produced by phytoplankton and degraded by cleavage and demethylation pathways, subsequently resulting in the generation of dimethyl sulphide (DMS), a climate-active gas, which may influence global warming (Curson et al., 2011, 2018; Li et al., 2014; Moran et al., 2012). In addition, a previous study found that inorganic S oxidation was linked to the biodegradation of volatile organosulphur compounds via hdr-like genes (Koch & Dahl, 2018), indicating the importance of linkages between inorganic and organic S transformations in S cycling. However, microbially mediated S cycling is complex in the environment, and much remains to be learned regarding the genes and pathways, especially for organic S transformation and linkages between inorganic and organic S transformation. Thus, it is critical to develop capabilities for the rapid and accurate analysis of S cycling microbial communities via advanced technologies.
Recently, high-throughput amplicon sequencing of functional genes, such as dsrA and dsrB, has expanded our knowledge on the diversity and composition of sulphite-/sulphate-reducing microorganisms (Pelikan et al., 2016; Vigneron et al., 2018). For example, amplicon sequencing analysis of dsrB genes revealed that the diversity of sulphate reducers could be underestimated, with approximately one-third of detected genes as uncharacterized lineages (Vigneron et al., 2018). However, as universal primers are not available for many S cycling genes, characterization of S cycling gene families and pathways as well as associated microorganisms cannot be resolved by amplicon sequencing approaches. The development of shotgun metagenome sequencing approaches has provided new insights into our understanding of biogeochemical cycling in natural ecosystems (Knight et al., 2018; Nayfach & Pollard, 2016; Quince et al., 2017; Sharpton, 2014). For metagenome sequencing data analysis, comprehensive and reliable orthology databases are of critical importance for accurate metagenomic profiling of functional gene families. An undesired observation is that results of metagenomic analysis are substantially affected by the selection of orthology databases (Nayfach & Pollard, 2016).
To date, several orthology databases such as arCOG (Archaeal Clusters of Orthologous Genes) (Makarova et al., 2015), COG (Clusters of Orthologous Groups) (Galperin et al., 2015), eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) (Huerta-Cepas et al., 2019) and KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa et al., 2016) have been developed and widely used for functional annotation in both genomic and metagenomic studies. These databases have their own distinct features due to differences in the design concept, with arCOG for archaeal annotation (Makarova et al., 2015), COG and eggNOG for annotation of orthologous groups (Galperin et al., 2015; Huerta-Cepas et al., 2019), and KEGG for linking genes with pathways (Kanehisa et al., 2016). These databases present several limitations for analysis of S cycling microbial communities, such as low coverage of S cycling genes (Baker et al., 2015; Vavourakis et al., 2019), difficulties in distinguishing homologous genes (e.g., sat vs. cysC, sbp vs. cysP, or psrA vs. phsA) (Marietou et al., 2018; Rückert, 2016; Wasmund et al., 2017), and long database searching time (Kim et al., 2013; Scholz et al., 2012). Recently, a specific “small database” NCycDB was developed to facilitate shotgun metagenome sequencing data analysis of nitrogen cycling gene families (Tu et al., 2019). NCycDB has been applied to profile N cycling microbial communities from various environments (Anwar et al., 2019; Zhang et al., 2020), demonstrating its high coverage, accuracy and efficiency. Therefore, it is essential to develop a comprehensive and accurate database for fast functional and taxonomic analysis of S cycling microbial communities in metagenomic studies.
In the present study, to understand the microbial ecology of the S cycle in the environment, we present a curated sulphur cycling database (SCycDB) containing 207 S cycling gene families and associated homologous groups involved in eight pathways, including assimilatory sulphate reduction, dissimilatory S reduction and oxidation, S reduction, SOX systems, S oxidation, S disproportionation, organic S transformation, and the linkages between inorganic and organic S transformation. By integrating multiple orthology databases, SCycDB is characterized by high specificity, comprehensiveness, representativeness and accuracy for rapid profiling of S cycling microbial communities. SCycDB was applied to functionally and taxonomically analyse metagenome sequencing data from four different habitats (freshwater, hot spring, marine sediment and soil). The results demonstrate that SCycDB is a powerful tool for rapid and accurate profiling of S cycling microbial communities, and can be widely used to analyse microbially mediated S cycling processes and underlying mechanisms in the environment.
2 METHODS
2.1 Database construction
An improved pipeline built upon a previous study was used to construct SCycDB (Tu et al., 2019) (Figure 1). First, a core database was manually constructed based on current knowledge of and literature on S cycling processes (Hausmann et al., 2018; Moran & Durham, 2019; Rückert, 2016; Vavourakis et al., 2019; Wasmund et al., 2017). S metabolism pathways in KEGG were also referenced (Kanehisa et al., 2016). By creating and refining keywords for each gene family involved in S cycling, seed sequences for each gene family were downloaded from the Swiss-Prot database (The Uniprot Consortium, 2017). For gene families without reference sequences in Swiss-Prot, manually checked high-quality sequences were downloaded from TrEMBL (The Uniprot Consortium, 2017). To ensure the accuracy of SCycDB, seed sequences for each gene family were manually checked based on their annotation and similarities with other sequences, especially for those without reference sequences in Swiss-Prot. Sequences downloaded from TrEMBL with the same keywords sharing ≥30% identity with seed sequences were merged with seed sequences, forming the core database (Figure 1a). Second, sequences belonging to S cycling gene families and their orthologues in public databases were identified and merged with the core database, forming the full database (Figure 1b). Publicly available orthology databases including arCOG, COG, eggNOG and KEGG were recruited and searched against the core database. Gene families involved in S cycling and their homologues were identified. Corresponding sequences were extracted and included in SCycDB. By doing so, the comprehensiveness of SCycDB was expected to improve, while the “small database” issue that may lead to increased false positive assignments was expected to diminish or be eliminated (Tu et al., 2019). In addition, corresponding sequences (S cycling gene families and homologues) in the NCBI archaea and bacteria RefSeq databases were also identified, extracted and merged. Taxonomic coverage of S cycling genes and pathways in SCycDB was summarized from corresponding sequences in the NCBI RefSeq. Sequences of both S cycling gene families and homologous gene families were clustered by cd-hit (Fu et al., 2012) at 100% identity. All representative sequences and related information were checked and used to construct SCycDB. Finally, we included PERL scripts with three candidate database searching tools (usearch, blast and diamond) for both functional and taxonomic profiling of shotgun metagenomes (Figure 1c). Both functional and taxonomic profiles can be generated by searching raw reads, predicted genes or protein sequences against SCycDB. A random subsampling function is also provided in the PERL scripts to eliminate sequencing depth differences among different samples. Functional profiles of S cycling microbial communities are provided at the gene family level. Taxonomic profiles of S cycling microbial communities are provided at various taxonomic levels.

2.2 Database sources
We used the UniProt database to retrieve seed sequences and construct the core database (The Uniprot Consortium, 2017). The orthology databases used for database merging and homologous gene identification in this study included arCOG (Makarova et al., 2015), COG (Galperin et al., 2015), eggNOG (Huerta-Cepas et al., 2019) and KEGG (Kanehisa et al., 2016). The NCBI RefSeq database (O’Leary et al., 2016) of archaea and bacteria was used for enriching SCycDB and for taxonomically classifying S cycling microbial communities.
2.3 Case study
We applied SCycDB to analyse S cycling microbial communities from four distinct habitats: freshwater, hot spring, marine sediment and soil. The metagenome sequencing data files were downloaded from the NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra) (Table S1) (Bahram et al., 2018; Dodsworth et al., 2013; Mitchell et al., 2018; Seitz et al., 2016; Tran et al., 2018). The forward and reverse reads were merged by the program pear (options: -q 30) (Zhang et al., 2014). Merged sequences were searched against the arCOG, COG, eggNOG, KEGG and SCycDB databases with the diamond program (options: -k 1 -e 0.0001, -p 20) (Buchfink et al., 2015). Sequences matched to SCycDB were extracted to generate functional profiles of S microbial communities. These sequences were subsequently used to generate taxonomic profiles of S cycling microbial communities at different taxonomic levels using kraken2 (Wood et al., 2019). One-way analysis of variance (ANOVA) was performed with the ibm spss 22 (SPSS Inc.), and used to compare the abundances of gene families and taxonomic groups among different habitats.
3 RESULTS
3.1 Summary of gene families and pathways in SCycDB
The constructed SCycDB contains 585,055 sequences covering 207 gene families involved in eight key S cycling pathways, including assimilatory sulphate reduction, dissimilatory S reduction and oxidation, S reduction, SOX systems, S oxidation, S disproportionation, organic S transformation, and linkages between inorganic and organic S transformation; S compound transporters are also included as “others” (Table 1; Table S2, Figures S1–S3).
Pathways | Gene | Annotation | Core database sequences | Full database sequences | Orthology groups | |||
---|---|---|---|---|---|---|---|---|
arCOG | COG | eggNOG | KEGG | |||||
Assimilatory sulphate reduction | cysC | Adenylylsulphate kinase | 13,037 | 20,875 | 39 | 51 | 1025 | 135 |
cysND | Sulphate adenylyltransferase | 18,468 | 29,706 | 19 | 38 | 593 | 91 | |
cysH | Phosphoadenosine phosphosulphate reductase | 7962 | 14,572 | 10 | 10 | 234 | 32 | |
cysIJ | Sulphite reductase | 8030 | 19,497 | 11 | 21 | 608 | 81 | |
cysNC | Bifunctional enzyme CysN/CysC | 4029 | 6383 | 12 | 12 | 371 | 43 | |
cysQ | 3′(2′), 5′-bisphosphate nucleotidase | 9548 | 14,916 | 15 | 28 | 534 | 72 | |
nrnA | Bifunctional oligoribonuclease and PAP phosphatase | 308 | 2,972 | 2 | 3 | 37 | 10 | |
sat | Sulphate adenylyltransferase | 4307 | 6017 | 8 | 11 | 270 | 47 | |
sir | Sulphite reductase (ferredoxin) | 418 | 2517 | 1 | 3 | 95 | 8 | |
Dissimilatory sulphur reduction and oxidation | aprAB | Adenylylsulphate reductase | 92 | 558 | 9 | 8 | 36 | 18 |
dsrAB | Dissimilatory sulphite reductase | 11,869 | 11,741 | 11 | 13 | 186 | 31 | |
dsrC | Dissimilatory sulphite reductase related protein | 35 | 247 | 0 | 0 | 1 | 0 | |
dsrDNT | Protein DsrD DsrN DsrT | 19 | 167 | 2 | NA | 3 | 3 | |
dsrEFH | Sulfurtransferase | 63 | 423 | NA | 1 | 5 | 1 | |
dsrL | NADPH: acceptor oxidorductase DsrL | 10 | 43 | 3 | 6 | 19 | 3 | |
dsrMKJOP | Membrane-Bound DsrMKJOP complex | 65 | 626 | 2 | 4 | 24 | 11 | |
qmoABC | Quinone-modifying oxidoreductase | 33 | 417 | 6 | NA | 23 | 2 | |
rdsr | Reverse dissimilatory sulphite reductase | 81 | 115 | NA | NA | 6 | 2 | |
sat | Sulphate adenylyltransferase | 4307 | 6017 | 8 | 11 | 270 | 47 | |
Sulphur reduction | asrABC | Anaerobic sulphite reductase | 716 | 1833 | 11 | 16 | 35 | 15 |
fsr | Sulphite reductase (coenzyme F420) | 3 | 26 | 4 | 4 | 9 | 3 | |
hydABDG | Sulphhydrogenase | 7 | 161 | 2 | 2 | 2 | 2 | |
mccA | Dissimilatory sulphite reductase | 6 | 15 | NA | 1 | 3 | NA | |
otr | Octaheme tetrathionate reductase Otr | 24 | 248 | 2 | 1 | 12 | 2 | |
psrABC | Polysulphide reductase | 6 | 172 | NA | 0 | 6 | 3 | |
rdlA | Putative rhodanese-like protein | 8 | 54 | NA | NA | 2 | NA | |
shyABCD | Sulphhydrogenase 2 | 12 | 313 | 0 | 2 | 2 | 2 | |
sreABC | Sulphur reductase | 12 | 66 | 0 | 1 | 3 | 1 | |
sudAB | Sulphide dehydrogenase | 119 | 2574 | 9 | 20 | 54 | 16 | |
ttrABC | Tetrathionate reductase | 1792 | 6084 | 18 | 17 | 175 | 39 | |
SOX systems | soxAX | l-cysteine S-thiosulphotransferase | 1734 | 4500 | 5 | 8 | 248 | 34 |
soxB | S-sulfosulfanyl-l-cysteine sulphohydrolase | 1357 | 2068 | 2 | 3 | 64 | 21 | |
soxCD | Sulphite dehydrogenase | 1069 | 3491 | 4 | 14 | 240 | 27 | |
soxYZ | Sulphur-oxidizing protein SoxYZ | 2065 | 4939 | 3 | 10 | 142 | 26 | |
Sulphur oxidation | doxAD | Thiosulphate dehydrogenase [quinone] | 3 | 31 | 0 | NA | 1 | 0 |
fccAB | Sulphide dehydrogenase | 75 | 962 | 1 | 1 | 36 | 6 | |
glpE | Thiosulphate sulfurtransferase | 3727 | 8073 | 1 | 11 | 74 | 22 | |
soeABC | Sulphite dehydrogenase | 6 | 617 | NA | 2 | 6 | 2 | |
sorAB | Sulphite cytochrome c oxidoreductase | 82 | 918 | 1 | 4 | 20 | 8 | |
sqr | Sulphide:quinone oxidoreductase | 89 | 559 | 0 | 2 | 1 | 1 | |
sseA | Thiosulphate sulfurtransferase | 30 | 3165 | 0 | 1 | 17 | 5 | |
tsdAB | Thiosulphate dehydrogenase | 21 | 1047 | NA | 0 | 6 | 2 | |
Sulphur disproportionation | phsABC | Thiosulphate reductase | 506 | 1343 | 4 | 7 | 39 | 14 |
tetH | Tetrathionate hydrolase TetH | 3 | 22 | NA | NA | 0 | NA | |
sor | Sulphur oxygenase/reductase | 9 | 29 | 0 | NA | 0 | 0 | |
Organic sulphur transformation | acuI | Acrylyl-CoA reductase AcuI | 458 | 4301 | 1 | 7 | 121 | 29 |
acuNK | Acrylyl-CoA transferase and hydratase | 4 | 63 | 3 | 3 | 9 | 13 | |
betAB | Betaine biosynthesis protein | 9452 | 32,160 | 9 | 28 | 398 | 116 | |
betC | Choline-sulphatase | 1689 | 4490 | NA | 2 | 75 | 12 | |
comABCDE | Coenzyme M biosynthesis protein | 1933 | 4377 | 10 | 10 | 221 | 27 | |
dddAC | 3-Hydroxypropionate dehydrogenase | 8 | 67 | NA | 2 | 6 | 3 | |
dddDKLPQWY | Dimethlysulphonioproprionate lyase | 48 | 441 | 1 | 1 | 8 | 4 | |
dddT | Betaine/carnitine/choline transporter | 2 | 103 | NA | NA | 7 | 2 | |
ddhABC | Dimethylsulphide dehydrogenase | 11 | 51 | 1 | 2 | 4 | 4 | |
dmdABCD | Dimethylsulphoniopropionate demethylation protein | 156 | 6234 | 2 | 6 | 34 | 21 | |
dmoA | Dimethyl-sulphide monooxygenase | 213 | 1900 | 2 | 2 | 36 | 9 | |
dmsABC | Anaerobic dimethyl sulphoxide reductase | 2670 | 16,676 | 27 | 29 | 245 | 67 | |
dsyB | DsyB | 94 | 64 | NA | NA | 20 | 3 | |
gdh | Glutamate dehydrogenase (NADP+) | 32 | 3931 | 0 | 1 | 10 | 2 | |
hpsN | Sulphopropanediol 3-dehydrogenase | 35 | 976 | 1 | NA | 6 | 3 | |
hpsOP | R or S-dihydroxypropanesulphonate−2-dehydrogenase | 4 | 122 | 1 | 2 | 4 | 11 | |
iseJ | Isethionate dehydrogenase | 3 | 6 | NA | 1 | 1 | 3 | |
isfD | Sulphoacetaldehyde reductase | 143 | 2726 | 2 | 6 | 22 | 18 | |
mddA | Methanethiol S-methyltransferase | 6 | 315 | NA | 0 | 2 | 0 | |
mdh | Malate dehydrogenase | 28,829 | 30,400 | 38 | 67 | 858 | 155 | |
mtsAB | Methylthiol:coenzyme M methyltransferase | 5 | 37 | 0 | NA | 2 | 0 | |
prpE | Propionate–CoA ligase | 1295 | 7177 | 5 | 4 | 93 | 21 | |
pta | Phosphate acetyltransferase | 6697 | 20,805 | 27 | 33 | 736 | 110 | |
sfnG | Dimethylsulphone monooxygenase | 10 | 721 | 1 | 0 | 4 | 1 | |
slcCD | Sulpholactate dehydrogenase | 57 | 3216 | 5 | 4 | 43 | 29 | |
sqdBDX | Sulpholipid sulphoquinovosyl diacylglycerol biosynthesis protein | 204 | 2045 | 1 | 6 | 28 | 13 | |
tauXY | Taurine dehydrogenase | 24 | 140 | NA | 2 | 10 | 4 | |
tmm | Trimethylamine monooxygenase | 24 | 382 | NA | NA | 2 | 1 | |
toa | Taurine:2-oxoglutarate transaminase | 3 | 226 | 2 | 2 | 5 | 3 | |
tpa | Taurine-pyruvate aminotransferase | 128 | 2263 | 5 | 4 | 17 | 21 | |
yihQ | Sulphoquinovosidase | 35 | 816 | NA | 0 | 2 | 1 | |
Linkages between inorganic and organic sulphur transformation | cuyA | l-cysteate sulpho-lyase | 62 | 1116 | 1 | 0 | 8 | 2 |
cysEKMO | Cysteine biosynthesis protein | 25,547 | 62,541 | 57 | 90 | 1389 | 269 | |
hdrABCDE | Heterodisulphide reductase | 122 | 1348 | 39 | 35 | 94 | 26 | |
mccB | Cystathionine gamma-lyase/homocysteine desulphydrase | 176 | 1238 | NA | 7 | 49 | 22 | |
metABCXYZ | l-Cystathionine biosynthesis protein | 16,131 | 52,563 | 29 | 39 | 1103 | 171 | |
msmAB | Methanesulphonate monooxygenase | 9 | 36 | NA | 1 | 3 | 1 | |
mtoX | Methanethiol oxidase | 11 | 27 | 1 | 1 | 3 | 1 | |
ssuDE | Alkanesulphonate monooxygenase | 7663 | 15,664 | 4 | 8 | 290 | 44 | |
suyAB | (2R)-sulpholactate sulpho-lyase | 138 | 1182 | 3 | 1 | 16 | 11 | |
tauD | Taurine dioxygenase | 1129 | 6531 | 1 | 1 | 34 | 14 | |
tbuBC | Toluene−3-monooxygenase | 7 | 90 | NA | 2 | 2 | 1 | |
tmoCF | Toluene−4-monooxygenase | 10 | 140 | NA | 1 | 4 | 4 | |
touCF | Toluene o-xylene monooxygenase | 5 | 14 | NA | 2 | 4 | 1 | |
xsc | Sulphoacetaldehyde acetyltransferase | 1026 | 2130 | 1 | 6 | 99 | 16 | |
Others | cuyZ | Sulphite exporter | 1 | 6 | NA | 1 | 5 | 1 |
cysAPUWZ | Sulphate/thiosulphate transporter | 18,458 | 38,216 | 93 | 138 | 1628 | 415 | |
hpsKLM | Dihydroxypropanesulphonate transporter | 5 | 172 | NA | NA | 4 | NA | |
iseKLM | Isethionate TRAP transporter | 3 | 153 | NA | NA | NA | 1 | |
sbp | Sulphate-binding protein | 855 | 5602 | 2 | 0 | 71 | 7 | |
sgpABC | Sulphur globule protein | 10 | 288 | 2 | NA | 189 | 25 | |
soxL | Sulphur transferase, periplasm | 2 | 38 | NA | 1 | 1 | 1 | |
ssuABC | Sulphonate transport system | 9787 | 34,648 | 43 | 104 | 970 | 347 | |
sulP | Sulphate permease | 542 | 4727 | 31 | 16 | 729 | 82 | |
tauABC | Taurine transport system | 4332 | 13,826 | 20 | 57 | 270 | 143 | |
tauE | Sulphite/organosulphonate exporter | 1 | 32 | NA | NA | 3 | NA | |
tauZ | Membrane protein TauZ | 7 | 236 | NA | 1 | 14 | 2 | |
tusA | Sulphur carrier protein TusA | 3431 | 3907 | 2 | 6 | 39 | 14 | |
tusBCDE | tRNA 2-thiouridine synthesizing protein | 5641 | 10,234 | 7 | 9 | 132 | 24 |
Note
- The gene families responsible for identical reactions were combined together. More detailed information is provided in Table S2. NA: not detected in the database.
3.1.1 Assimilatory sulphate reduction
A total of 11 gene families with 117,455 representative sequences and 4580 homologous orthology groups are included for this pathway (Table 1; Figure S1A). Among these, gene families including cysD, cysN and sat participate in sulphate activation to adenosine 5′-phosphosulphate (APS), and cysC converts APS to phosphoadenosine 5′-phosphosulphate (PAPS). The gene family cysNC encodes the biofunctional enzyme CysN/CysC responsible for sulphate conversion to PAPS, cysH for PAPS reduction to sulphite, and cysI, cysJ and sir for sulphite reduction to sulfide.
3.1.2 Dissimilatory sulphur reduction and oxidation
Twenty-two gene families with 20,354 sequences and 775 homologous orthology groups are covered for dissimilatory S reduction and oxidation (Table 1; Figure S1B). The gene family sat participates in the conversion between sulphate and APS, and aprAB and qmoABC for the transformation between APS and sulphite. The dsr gene families are involved in both dissimilatory S reduction and oxidation, with some members of the gene families (e.g., dsrAB, dsrC, dsrD, dsrEFH, dsrL, dsrMKJOP) responsible for the transformation between sulphite and sulphide.
3.1.3 SOX systems
Seven gene families, including soxA, soxB, soxC, soxD, soxX, soxY and soxZ, are involved in SOX systems for thiosulphate oxidation to sulphate (Table 1; Figure S1C). The SOX system genes encode SoxAX, SoxYZ, SoxB and SoxCD proteins. A total of 14,998 sequences and 851 homologous orthology groups are included in SCycDB.
3.1.4 Sulphur reduction
The S reduction pathway contains 26 gene families encoding sulphite reductase, tetrathionate reductase, S reductase and polysulphide reductase with a total of 11,546 representative sequences and 496 homologous orthology groups (Table 1; Figure S1D). Among these, asrABC, fsr and mccA are responsible for sulphite reduction to sulphide, otr and ttrABC for tetrathionate reduction to thiosulphate, sreABC and psrABC for elemental S reduction and polysulphide reduction, respectively, and hydABDG, shyABCD and sudAB for the reduction of both elemental S and polysulphide to sulphide.
3.1.5 Sulphur oxidation
A total of 14 gene families are involved in S oxidation pathways with a total of 15,372 representative sequences and 231 homologous orthology groups (Table 1; Figure S1D). The fccAB and sqr gene families participate in sulphide oxidation, doxAD, glpE, sseA and tsdAB in thiosulphate oxidation, and soeABC and sorAB in sulphite oxidation.
3.1.6 Sulphur disproportionation
Gene families such as phsABC, tetH and sor are included for this pathway with 1394 sequences and 64 homologous orthology groups (Table 1; Figure S1D). Among these, phsABC gene families encode thiosulphate reductase responsible for the transformation of thiosulphate to sulphite and sulphide, tetH for the disproportionation of tetrathionate to elemental S, thiosulphate and sulphate, and sor for the conversion of elemental S to sulphite and sulphide.
3.1.7 Organic sulphur transformation
There are 57 gene families involved in organic S transformation with a total of 147,231 sequences and 4103 homologous orthology groups (Table 1; Figure S2). Among these, the gene family dsyB encodes methyltransferase, a key enzyme for DMSP biosynthesis. For DMSP degradation, two pathways are involved, including the cleavage pathway with dddDKLPQWY encoding DMSP lyase for the conversion of DMSP to DMS and acrylate, and the demethylation pathway with dmdABCD for the conversion of DMSP to methylmercaptopropionate (MMPA), further generating methanethiol (MeSH) and acetaldehyde (Curson et al., 2011; Moran & Durham, 2019; Moran et al., 2012). The acuI, acuKN and prpE gene families participate in acrylate utilization and detoxification, while the dmsABC, ddhABC and tmm families are involved in the transformation between DMS and sulphoxide (DMSO) (Bilous et al., 1988; Lidbury et al., 2016; Wang et al., 2017). Other diverse organic S compounds, such as sulpholipid, sulphonate and sulphate ester, are also involved in organic S metabolism. Two enzymes encoded by sqdB and sqdDX are related to the biosynthesis of sulpholipid sulphoquinovosyl diacylglycerides (SQDG), while sulphoquinovosidase encoded by yihQ subsequently converts SQDG to sulphoquinovose (SQ) (Moran & Durham, 2019; Speciale et al., 2016). The tauXY, toa, tpa and iseJ gene families are responsible for C2 sulphonate (taurine, isethionate) conversion to sulphoacetaldehyde, and xsc and pta for the transformation of sulphoacetaldehyde to acetyl-CoA (Durham et al., 2019; Landa et al., 2019). The hpsOPN and slcCD gene families are related to the transformation of C3 sulphonate DHPS, and betABC associated with the utilization of sulphate ester choline-o-sulphate (Landa et al., 2019).
3.1.8 Linkages between inorganic and organic sulphur transformation
There are 35 gene families responsible for linking inorganic and organic S transformation with a total of 144,620 sequences and 4011 homologous orthology groups (Table 1; Figure S3). The hdrABCDE gene families encode a heterodisulphide reductase-like system, which links DMS oxidation with thiosulphate reduction (Koch & Dahl, 2018). Gene families including cuyA, msmAB, ssuDE, suyAB, tbuBC, tmoCF, touCF and xsc link the transformation between organic S compounds (such as alkanesulphonate, l-cysteate, methanesulphonate, sulpholactate, sulphoacetaldehyde and taurine) and sulphite, with other gene families, namely cysEKMO, mccB, metABCXYZ and mtoX, linking the transformation between organic S compounds (such as l-cysteine, l-homocysteine, l-serine and MeSH) and sulphide (Byrne et al., 1995; Landa et al., 2019; Moran & Durham, 2019; Wasmund et al., 2017).
3.1.9 Others
Thirty-one gene families encoding various transporters for sulphate, sulphite, thiosulphate and organic S compounds are also included in SCycDB with a total of 112,085 sequences and 5650 homologous orthology groups (Table 1).
3.2 A comparison of gene families detected by SCycDB and other orthology databases
To evaluate the coverage of S cycling gene families in SCycDB, the developed SCycDB was compared with other publicly available orthology databases including arCOG, COG, eggNOG and KEGG. Several critical issues affecting accurate functional assignments in metagenomics were noted. First, there are 207 gene families in SCycDB, while only 62, 130, 138 and 152 gene families are included in the arCOG, COG, eggNOG and KEGG orthology databases, respectively (Figure S4). Second, several key S cycling gene families are included in SCycDB but missing in these four public orthology databases, such as gene families for dissimilatory S reduction and oxidation (dsrMKJOP), sulphur reduction (mccA, otr, rdlA), sulphur oxidation (sorAB), sulphur disproportionation (tetH), organic sulphur transformation (dddAC, dddKQWY, dsyB) and others (Figure 2). Third, in the four public orthology databases, many different gene families defined by SCycDB were merged into one orthologous group; conversely, single gene families with distinct classification in SCycDB could be correctly found in multiple orthologous groups (Table S3). For instance, dsrAB, asrC and fsr for different S reduction pathways are merged into one orthology group in COG and eggNOG (Table S3). Similarly, phsA and psrA are not clearly distinguished in COG, eggNOG and KEGG (Table S3), and they were phylogenetically separated in SCycDB (Figure S5). Therefore, SCycDB, specifically designed to target gene families involved in S metabolism, has advantages over existing orthology databases in terms of coverage, representativeness and accuracy.

3.3 Taxonomic composition of S cycling genes and pathways in SCycDB
To understand the taxonomic composition of S cycling genes and pathways in SCycDB, we mapped sequences targeting S cycling genes and pathways to their affiliated reference genomes from the NCBI RefSeq. In total, the developed SCycDB covers 47 phyla, 82 classes, 197 orders, 461 families and 2562 genera of bacteria, and five phyla, 12 classes, 22 orders, 37 families and 122 genera of archaea (Table 2). For bacteria, Proteobacteria (this phylum covers 91.3% of the genes), Firmicutes (67.6%), Actinobacteria (62.8%) and Bacteroidetes (44.0%) are the dominant phyla, with Pseudomonas (this genus covers 51.7% genes), Escherichia (45.9%), Bacillus (45.4%) and Vibrio (36.7%) representing the dominant genera in SCycDB (Table S4). Further analysis shows that organic S transformation has the highest coverage of microorganisms, containing 42 phyla and 2289 genera, and especially assimilatory sulphate reduction accounts for one of the largest coverage groups with 40 phyla and 2059 genera, while 40 phyla and 2204 genera are involved in the linkages between inorganic and organic S transformation (Table 2). For archaea, Euryarchaeota, Crenarchaeota, Thaumarchaeota, Candidatus Bathyarchaeota and Candidatus Korarchaeota are the dominant phyla in SCycDB (Table S4). At the genus level, organic S transformation has the highest diversity with the involvement of 84 genera, followed by assimilatory sulphate reduction (81 genera), and linkages between inorganic and organic S transformation (76 genera) (Table 2). These results indicate that SCycDB covers a high diversity of microorganisms participating in the S cycle, providing a useful platform for the search and annotation of S cycling genes, pathways and associated key microorganisms in the environment.
Pathway | Phylum | Class | Order | Family | Genus | |||||
---|---|---|---|---|---|---|---|---|---|---|
Archaea | Bacteria | Archaea | Bacteria | Archaea | Bacteria | Archaea | Bacteria | Archaea | Bacteria | |
Assimilatory sulphate reduction | 5 | 40 | 9 | 74 | 17 | 179 | 25 | 417 | 81 | 2059 |
Dissimilatory sulphur reduction and oxidation | 4 | 29 | 8 | 52 | 16 | 117 | 21 | 235 | 49 | 657 |
Sulphur reduction | 3 | 23 | 7 | 44 | 11 | 88 | 19 | 173 | 27 | 392 |
SOX systems | 1 | 19 | 4 | 35 | 6 | 91 | 8 | 192 | 12 | 683 |
Sulphur oxidation | 2 | 22 | 3 | 34 | 4 | 79 | 7 | 167 | 8 | 506 |
Sulphur disproportionation | 2 | 4 | 2 | 11 | 2 | 24 | 2 | 37 | 3 | 53 |
Organic sulphur transformation | 5 | 42 | 12 | 73 | 21 | 182 | 32 | 437 | 84 | 2289 |
Linkages between inorganic and organic sulphur transformation | 5 | 40 | 11 | 74 | 19 | 177 | 27 | 405 | 76 | 2204 |
Others | 5 | 33 | 9 | 66 | 13 | 163 | 19 | 378 | 52 | 1746 |
Total | 5 | 47 | 12 | 82 | 22 | 197 | 37 | 461 | 122 | 2562 |
NCBI RefSeq | 25 | 156 | 18 | 98 | 29 | 224 | 49 | 530 | 194 | 3815 |
3.4 Application of SCycDB for functional and taxonomic profiling of environmental samples
We applied SCycDB and four other orthology databases (arCOG, COG, eggNOG and KEGG) to profile S cycling microbial communities from four habitats: freshwater, hot spring, marine sediment and soil (Figures 3 and 4; Figure S6). The number of S cycling gene families detected by searching against SCycDB ranged from 174 to 188 in the four habitats, which was significantly (p < .05) greater than the other four databases (55–58 in arCOG, 125–128 in COG, 129–134 in eggNOG, 120–135 in KEGG). Notably, the run-time with SCycDB (418–2264 s) was much shorter than with eggNOG (3625–17,749 s) and KEGG (2243–11,161 s) (Table S5).


SCycDB profiles of S cycling microbial communities showed that the overall functional or taxonomic composition was significantly different (p < .05) among the four habitats profiled in this study (Table S6, Figure S7). Functional profiling of the microbial communities showed S cycling functional genes and pathways were differentially enriched in different habitats (Figure 3). For example, soil habitat exhibited the highest abundance of gene families involved in SOX systems (soxAX, soxCD, soxYZ), and S oxidation (sorAB, tsdAB) as well as DMSP biosynthesis and degradation (dsyB, dddDKLPQWY, dmdABCD, acuNK), with marine sediment having particularly high abundances of S reduction gene families (asrABC, shyABCD and ttrABC) and DMS transformation genes (ddhABC, dmsABC) (Figure 3). Taxonomic profiling of S cycling microbial communities showed that Proteobacteria was the dominant phylum of S cycling microbial communities in all four habitats (Figure S8), which is consistent with the copious representation of Proteobacteria in SCycDB (Table S4). At the genus level, the abundance of Desulfallas, Desulfobacter, Desulfococcus, Desulfomonile, Desulfotomaculum and Syntrophobacter for dissimilatory sulphate reduction, S reduction and disproportionation was higher in marine sediment than in the other three habitats (Figure 4). In contrast, the soil habitat showed high abundances of Halomonas, Pseudomonas, Rhodobacter, Roseobacter, Roseovarius, Ruegeria, Sagittula and Sulfitobacter, which are related to DMSP production and degradation (Figure 4). The above results show that SCycDB is a powerful tool to facilitate analysis of shotgun metagenome sequencing data, enabled by the capacity for fast, comprehensive and accurate functional and taxonomic profiling of S cycling microbial communities in various environments.
4 DISCUSSION
The S cycle is an important biogeochemical process largely driven by microorganisms, impacting the cycling of C and N as well as global change (Curson et al., 2011; Landa et al., 2019; Muyzer & Stams, 2008; Wasmund et al., 2017). Characterizing the function and taxonomy of microbial processes involved in S cycling is critical in providing a better understanding of the diversity of S cycling microbial populations and specific impacts on the environment. Here, we developed SCycDB for fast and accurate functional and taxonomic profiling of S cycling microbial communities, and subsequently applied this database for metagenome sequencing data analysis. The results demonstrate that SCycDB is a useful tool to profile S cycling microbial communities from different environments. To our knowledge, this is the first comprehensive, specific database for analysing both functional and taxonomic profiling of S cycling microbial communities.
Manually curated databases are of vital importance to improve the reliability and reproducibility during bioinformatics analysis of metagenomic data (Kanehisa et al., 2016; The Uniprot Consortium, 2017). Automatically generated orthology databases, including arCOG, COG, eggNOG and KEGG, cover 62–152 gene families involved in microbial S cycling (Galperin et al., 2015; Huerta-Cepas et al., 2019). In comparison, SCycDB is much more comprehensive, covering 207 gene families with 585,055 representative sequences. The gene families included in SCycDB are retrieved manually based on publicly available databases and most up-to-date knowledge of S cycling. For instance, SCycDB covers gene families otherwise not included in existing databases, such as those involved in DMSP synthesis (dsyB) (Curson et al., 2017), acrylate utilization and detoxification (acuNK and dddAC) (Wang et al., 2017), and DMSP cleavage (dddKQTWY) (Li et al., 2017; Moran & Durham, 2019; Peng et al., 2019), enabling researchers to study these newly discovered gene families and metabolic pathways. These gene families have no clearly defined orthology groups in other publicly available databases, but play important roles in regulating marine S cycling and mediating the climate-active gas DMS (Moran & Durham, 2019). Also, SCycDB includes not only commonly known gene families including dsrAB, dsrC and dsrEFH, but also other poorly known dsr gene families (e.g., dsrMKJOP, dsrL, dsrN and dsrT) for dissimilatory S reduction and oxidation (Hausmann et al., 2018; Löffler et al., 2020; Pires et al., 2006). In addition, to facilitate both functional annotations and taxonomic assignments that require more accurate sequences with taxonomic information (Tu et al., 2019), the NCBI RefSeq database has been integrated into SCycDB to increase the coverage of functional gene sequences and their associated taxonomic information. Therefore, SCycDB provides the much desired ability to explore questions of “who is there” and “what they are doing” in microbial ecology.
Accuracy is critical in metagenome sequencing data analysis, which is largely dependent on reference databases (Quince et al., 2017). The SCycDB ensures its annotation accuracy in three major aspects. First, gene families and annotations have one-to-one corresponding relationships. As automatically generated orthology databases identify orthologous groups based on species-aware clustering algorithms (Huerta-Cepas et al., 2019), they could not clearly distinguish different homologous genes. For example, gene families psrA and phsA respectively encoding polysulphide reductase and thiosulphate reductase subunit are highly homologous, and thus they are always mis-annotated as a single orthology group in automatically generated orthologuey databases. In ScycDB, we have carefully looked into this issue, and manually separated them into two orthology groups. Second, SCycDB reduces potential mis-annotations, which may occur in automatically generated orthology databases. For example, the cysC gene sequences are generally grouped with sat sequences, resulting in the possibility of mis-annotations. Such occurrences are not uncommon, as found in cysP vs. sbp, metB vs. mccB, and sreA vs. soeA. Particularly problematic is the observation that a sequence may be assigned to more than one orthologous group. Therefore, we have manually checked those sequences and carefully assigned them to the correct gene groups to reduce possible mis-annotations in SCycDB. Third, several databases for profiling specific gene families, such as ARDB (for antibiotic resistance genes) and NCycDB (for N cycling genes) were recently developed (Liu & Pop, 2009; Tu et al., 2019). False positives could be an issue arising from the relatively small size of these specialized databases (Tu et al., 2019). To solve such a “small database” issue, SCycDB deliberately includes S cycling-related homologous orthology groups identified from multiple publicly available orthology databases. Therefore, the accuracy of annotation has been considerably enhanced with the implementation of these features.
Unlike other orthology databases, SCycDB is specific to profile S cycling microbial communities, resulting in fast annotation of functional genes, pathways and taxonomy. As shotgun metagenome sequencing data increase exponentially, fast processing of metagenome data sets is critical for metagenomic studies (Kim et al., 2013; Scholz et al., 2012; Wood & Salzberg, 2014; Zhou et al., 2015). A study of the taxonomic classifier MetaPhyler showed that it was much faster than other tools (PhymmBL, MEGAN, WebCarma) as its reference database was smaller than a general reference database (Liu et al., 2011). Also, a specific database NCycDB provides a fast profiling platform to identify N cycling gene families (Tu et al., 2019). In our study, we used 370 G metagenome data sets and ran on 20 CPU threads, resulting in run times of ~8, 66 and 42 h for SCycDB, eggNOG and KEGG, respectively. Therefore, SCycDB is a much faster database for the annotation of S cycling microbial communities in metagenomic studies.
Functional and taxonomic profiles are important objectives in shotgun metagenome sequencing data analysis to understand microbial communities from different environments (Knight et al., 2018; Quince et al., 2017). Accurate functional profiling requires comprehensive sequence databases for specific metabolic pathways, which is frequently unavailable. Using S metabolism as an example, several previous metagenomic studies only focused on inorganic S cycling, especially dissimilatory sulphate reduction (Baker et al., 2015; Hausmann et al., 2018; Vavourakis et al., 2019), probably due to a lack of organic S cycling gene families in the reference database. In this study, we included organic S cycling in SCycDB, and used it to analyse functional and taxonomic profiles of S cycling microbial communities from four types of environments, providing a full picture of microbial communities in natural ecosystems. Our results revealed a high diversity of S cycling gene families (154–193 gene families) and microorganisms (32–43 phyla and 692–1340 genera) in natural environments, especially for organic S transformation microbial communities. Also, we found significant variations in functional and taxonomic composition and structure of S cycling microbial communities among different environments. For instance, higher abundances of S reduction gene families and microorganisms in marine sediments were observed, probably linked to the importance of anaerobic respiration with S compounds as electron acceptors in the marine sediment (Jørgensen et al., 2019; Wasmund et al., 2017). Gene families and microorganisms involved in DMSP and DMS transformation were detected in all four environments, supporting the universal distribution of DMSP and DMS metabolism (Curson et al., 2011, 2017; Moran & Durham, 2019). Indeed, DMSP accounts for 10% of fixed carbon in marine environments and DMS plays an important role in S exchanges between the ocean and atmosphere (Curson et al., 2011; Landa et al., 2019; Todd et al., 2010). Consistently, the abundances of DMS transformation gene families were higher in marine sediment than in the other three environments. However, we identified a high abundance of DMSP biosynthesis and degradation gene families as well as associated microorganisms in the soil habitat, suggesting that DMSP transformation may also be an important process in soil. Therefore, these results demonstrate the vast diversity and importance of microbial S metabolisms in the environment that remain to be explored, which will be greatly facilitated by SCycDB developed in this study.
In summary, SCycDB is a manually curated, comprehensive database for fast and accurate functional and taxonomic analysis of S cycling microbial communities with shotgun metagenome sequencing data. By integrating multiple publicly available databases, the current SCycDB contains 207 gene families and 585,055 representative sequences as well as 20,761 homologous orthology groups to resolve the “small database” issue. Applied to profile S cycling microbial communities from various environments, SCycDB has demonstrated its utility for exploring the S cycling process and associated microbial communities in the environment. The SCycDB developed here provides a comprehensive and fast metagenomic analysis tool specialized for studying S metabolisms that will be periodically updated.
ACKNOWLEDGEMENTS
This work was supported by the National Natural Science Foundation of China (Grant Nos. 91951207, 31770539, 31971446, 31700427, 92051110), National Key Research and Development Program of China (Grant Nos. 2019YFA0606700, 2020YFA0607600, 2017YFA0604300) and Natural Sience Foundation of Shandong Province (Grant No. ZR201911110287).
CONFLICT OF INTEREST
The authors declare that they have no known competing interests.
AUTHOR CONTRIBUTIONS
Q.T. and Z.H. designed the database structure. X.Y., J.Z., W.S. and M.X. searched and manually collected the sequences. Q.T. wrote the scripts for database construction. X.Y. constructed the database and drafted the manuscript. Q.H., Y.P., Y.T., C.W., L.S., S.W., Q.Y., J.L., Q.T. and Z.H. revised the manuscript. All authors read and approved the final manuscript.
Open Research
DATA AVAILABILITY STATEMENT
SCycDB database files are available at https://github.com/qichao1984/SCycDB.