Volume 21, Issue 3 pp. 924-940
RESOURCE ARTICLE
Full Access

SCycDB: A curated functional gene database for metagenomic profiling of sulphur cycling pathways

Xiaoli Yu

Xiaoli Yu

Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China

Search for more papers by this author
Jiayin Zhou

Jiayin Zhou

Institute of Marine Science and Technology, Shandong University, Qingdao, China

Search for more papers by this author
Wen Song

Wen Song

Institute of Marine Science and Technology, Shandong University, Qingdao, China

Search for more papers by this author
Mengzhao Xu

Mengzhao Xu

Institute of Marine Science and Technology, Shandong University, Qingdao, China

Search for more papers by this author
Qiang He

Qiang He

Department of Civil and Environmental Engineering, The University of Tennessee, Knoxville, TN, USA

Search for more papers by this author
Yisheng Peng

Yisheng Peng

Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China

Search for more papers by this author
Yun Tian

Yun Tian

Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, School of Life Sciences, Xiamen University, Xiamen, China

Search for more papers by this author
Cheng Wang

Cheng Wang

Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China

Search for more papers by this author
Longfei Shu

Longfei Shu

Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China

Search for more papers by this author
Shanquan Wang

Shanquan Wang

Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China

Search for more papers by this author
Qingyun Yan

Qingyun Yan

Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China

Search for more papers by this author
Jihua Liu

Jihua Liu

Institute of Marine Science and Technology, Shandong University, Qingdao, China

Search for more papers by this author
Qichao Tu

Corresponding Author

Qichao Tu

Institute of Marine Science and Technology, Shandong University, Qingdao, China

Correspondence

Zhili He, Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China.

Email: [email protected]

Qichao Tu, Institute of Marine Science and Technology, Shandong University, Qingdao, China

Email: [email protected]

Search for more papers by this author
Zhili He

Corresponding Author

Zhili He

Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China

College of Agronomy, Hunan Agricultural University, Changsha, China

Correspondence

Zhili He, Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Sun Yat-sen University, Guangzhou, China.

Email: [email protected]

Qichao Tu, Institute of Marine Science and Technology, Shandong University, Qingdao, China

Email: [email protected]

Search for more papers by this author
First published: 07 December 2020
Citations: 100

Abstract

Microorganisms play important roles in the biogeochemical cycling of sulphur (S), an essential element in the Earth's biosphere. Shotgun metagenome sequencing has opened a new avenue to advance our understanding of S cycling microbial communities. However, accurate metagenomic profiling of S cycling microbial communities remains technically challenging, mainly due to low coverage and inaccurate definition of S cycling gene families in public orthology databases. Here we developed a manually curated S cycling database (SCycDB) to profile S cycling functional genes and taxonomic groups for shotgun metagenomes. The developed SCycDB contains 207 gene families and 585,055 representative sequences affiliated with 52 phyla and 2684 genera of bacteria/archaea, and 20,761 homologous orthology groups were also included to reduce false positive sequence assignments. SCycDB was applied for functional and taxonomic analysis of S cycling microbial communities from four habitats (freshwater, hot spring, marine sediment and soil). Gene families and microorganisms involved in S reduction were abundant in the marine sediment, while those of S oxidation and dimethylsulphoniopropionate transformation were abundant in the soil. SCycDB is expected to be a useful tool for fast and accurate metagenomic analysis of S cycling microbial communities in the environment.

1 INTRODUCTION

Sulphur (S) is an essential component of important biomolecules such as amino acids, vitamins and enzymes. S cycling is an important biogeochemical process in the Earth's biosphere (Fike et al., 2015; Moran & Durham, 2019; Muyzer & Stams, 2008; Wasmund et al., 2017), and is usually coupled with carbon (C), nitrogen (N) and metal cycling in natural ecosystems (Buongiorno et al., 2019; Landa et al., 2019; Zhu et al., 2018). Microorganisms play important roles in the biogeochemical cycling of S compounds, which are present in a large variety of chemical forms and redox states (Wasmund et al., 2017). S is abundant with active metabolism in diverse environments, such as marine sediments, hot springs, peatlands and coastal sediments (Baker et al., 2015; Hausmann et al., 2018; Lin et al., 2015; Wasmund et al., 2017). Characterizing the function and taxonomy of S cycling microbial communities is therefore of critical importance to understand microbially mediated S cycling processes and their regulatory mechanisms in the environment.

The S cycle consists of inorganic and organic S transformations. In inorganic S transformation, assimilatory sulphate reduction and dissimilatory sulphate reduction processes as well as their key functional genes such as sat, aprAB and dsrAB have been well studied (Müller et al., 2015), while other inorganic S forms, such as thiosulphate, tetrathionate, polysulphide and elemental S, need further clarification in terms of functional genes, pathways and associated microorganisms involved in biotransformation. Organic S transformation and the linkages between inorganic and organic S transformations are also important in the S cycle. As one of the most abundant organosulphur compounds in marine ecosystems, dimethylsulphoniopropionate (DMSP) is mainly produced by phytoplankton and degraded by cleavage and demethylation pathways, subsequently resulting in the generation of dimethyl sulphide (DMS), a climate-active gas, which may influence global warming (Curson et al., 2011, 2018; Li et al., 2014; Moran et al., 2012). In addition, a previous study found that inorganic S oxidation was linked to the biodegradation of volatile organosulphur compounds via hdr-like genes (Koch & Dahl, 2018), indicating the importance of linkages between inorganic and organic S transformations in S cycling. However, microbially mediated S cycling is complex in the environment, and much remains to be learned regarding the genes and pathways, especially for organic S transformation and linkages between inorganic and organic S transformation. Thus, it is critical to develop capabilities for the rapid and accurate analysis of S cycling microbial communities via advanced technologies.

Recently, high-throughput amplicon sequencing of functional genes, such as dsrA and dsrB, has expanded our knowledge on the diversity and composition of sulphite-/sulphate-reducing microorganisms (Pelikan et al., 2016; Vigneron et al., 2018). For example, amplicon sequencing analysis of dsrB genes revealed that the diversity of sulphate reducers could be underestimated, with approximately one-third of detected genes as uncharacterized lineages (Vigneron et al., 2018). However, as universal primers are not available for many S cycling genes, characterization of S cycling gene families and pathways as well as associated microorganisms cannot be resolved by amplicon sequencing approaches. The development of shotgun metagenome sequencing approaches has provided new insights into our understanding of biogeochemical cycling in natural ecosystems (Knight et al., 2018; Nayfach & Pollard, 2016; Quince et al., 2017; Sharpton, 2014). For metagenome sequencing data analysis, comprehensive and reliable orthology databases are of critical importance for accurate metagenomic profiling of functional gene families. An undesired observation is that results of metagenomic analysis are substantially affected by the selection of orthology databases (Nayfach & Pollard, 2016).

To date, several orthology databases such as arCOG (Archaeal Clusters of Orthologous Genes) (Makarova et al., 2015), COG (Clusters of Orthologous Groups) (Galperin et al., 2015), eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) (Huerta-Cepas et al., 2019) and KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa et al., 2016) have been developed and widely used for functional annotation in both genomic and metagenomic studies. These databases have their own distinct features due to differences in the design concept, with arCOG for archaeal annotation (Makarova et al., 2015), COG and eggNOG for annotation of orthologous groups (Galperin et al., 2015; Huerta-Cepas et al., 2019), and KEGG for linking genes with pathways (Kanehisa et al., 2016). These databases present several limitations for analysis of S cycling microbial communities, such as low coverage of S cycling genes (Baker et al., 2015; Vavourakis et al., 2019), difficulties in distinguishing homologous genes (e.g., sat vs. cysC, sbp vs. cysP, or psrA vs. phsA) (Marietou et al., 2018; Rückert, 2016; Wasmund et al., 2017), and long database searching time (Kim et al., 2013; Scholz et al., 2012). Recently, a specific “small database” NCycDB was developed to facilitate shotgun metagenome sequencing data analysis of nitrogen cycling gene families (Tu et al., 2019). NCycDB has been applied to profile N cycling microbial communities from various environments (Anwar et al., 2019; Zhang et al., 2020), demonstrating its high coverage, accuracy and efficiency. Therefore, it is essential to develop a comprehensive and accurate database for fast functional and taxonomic analysis of S cycling microbial communities in metagenomic studies.

In the present study, to understand the microbial ecology of the S cycle in the environment, we present a curated sulphur cycling database (SCycDB) containing 207 S cycling gene families and associated homologous groups involved in eight pathways, including assimilatory sulphate reduction, dissimilatory S reduction and oxidation, S reduction, SOX systems, S oxidation, S disproportionation, organic S transformation, and the linkages between inorganic and organic S transformation. By integrating multiple orthology databases, SCycDB is characterized by high specificity, comprehensiveness, representativeness and accuracy for rapid profiling of S cycling microbial communities. SCycDB was applied to functionally and taxonomically analyse metagenome sequencing data from four different habitats (freshwater, hot spring, marine sediment and soil). The results demonstrate that SCycDB is a powerful tool for rapid and accurate profiling of S cycling microbial communities, and can be widely used to analyse microbially mediated S cycling processes and underlying mechanisms in the environment.

2 METHODS

2.1 Database construction

An improved pipeline built upon a previous study was used to construct SCycDB (Tu et al., 2019) (Figure 1). First, a core database was manually constructed based on current knowledge of and literature on S cycling processes (Hausmann et al., 2018; Moran & Durham, 2019; Rückert, 2016; Vavourakis et al., 2019; Wasmund et al., 2017). S metabolism pathways in KEGG were also referenced (Kanehisa et al., 2016). By creating and refining keywords for each gene family involved in S cycling, seed sequences for each gene family were downloaded from the Swiss-Prot database (The Uniprot Consortium, 2017). For gene families without reference sequences in Swiss-Prot, manually checked high-quality sequences were downloaded from TrEMBL (The Uniprot Consortium, 2017). To ensure the accuracy of SCycDB, seed sequences for each gene family were manually checked based on their annotation and similarities with other sequences, especially for those without reference sequences in Swiss-Prot. Sequences downloaded from TrEMBL with the same keywords sharing ≥30% identity with seed sequences were merged with seed sequences, forming the core database (Figure 1a). Second, sequences belonging to S cycling gene families and their orthologues in public databases were identified and merged with the core database, forming the full database (Figure 1b). Publicly available orthology databases including arCOG, COG, eggNOG and KEGG were recruited and searched against the core database. Gene families involved in S cycling and their homologues were identified. Corresponding sequences were extracted and included in SCycDB. By doing so, the comprehensiveness of SCycDB was expected to improve, while the “small database” issue that may lead to increased false positive assignments was expected to diminish or be eliminated (Tu et al., 2019). In addition, corresponding sequences (S cycling gene families and homologues) in the NCBI archaea and bacteria RefSeq databases were also identified, extracted and merged. Taxonomic coverage of S cycling genes and pathways in SCycDB was summarized from corresponding sequences in the NCBI RefSeq. Sequences of both S cycling gene families and homologous gene families were clustered by cd-hit (Fu et al., 2012) at 100% identity. All representative sequences and related information were checked and used to construct SCycDB. Finally, we included PERL scripts with three candidate database searching tools (usearch, blast and diamond) for both functional and taxonomic profiling of shotgun metagenomes (Figure 1c). Both functional and taxonomic profiles can be generated by searching raw reads, predicted genes or protein sequences against SCycDB. A random subsampling function is also provided in the PERL scripts to eliminate sequencing depth differences among different samples. Functional profiles of S cycling microbial communities are provided at the gene family level. Taxonomic profiles of S cycling microbial communities are provided at various taxonomic levels.

Details are in the caption following the image
A framework of SCycDB construction. (a) Core database construction: seed sequences were retrieved from the Swiss-Prot database using manually refined keywords, and sequences retrieved from the TrEMBL database were merged with the seed sequences at a 30% identity cutoff, generating the core database. (b) Full database construction: S cycling gene families and homologous gene families were retrieved from the public orthology databases and NCBI RefSeq database, and representative sequences were extracted and included in the full database. (c) Metagenomic profiling: PERL scripts were provided to generate both functional and taxonomic profiles for shotgun metagenomes with selected searching tools [Colour figure can be viewed at wileyonlinelibrary.com]

2.2 Database sources

We used the UniProt database to retrieve seed sequences and construct the core database (The Uniprot Consortium, 2017). The orthology databases used for database merging and homologous gene identification in this study included arCOG (Makarova et al., 2015), COG (Galperin et al., 2015), eggNOG (Huerta-Cepas et al., 2019) and KEGG (Kanehisa et al., 2016). The NCBI RefSeq database (O’Leary et al., 2016) of archaea and bacteria was used for enriching SCycDB and for taxonomically classifying S cycling microbial communities.

2.3 Case study

We applied SCycDB to analyse S cycling microbial communities from four distinct habitats: freshwater, hot spring, marine sediment and soil. The metagenome sequencing data files were downloaded from the NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra) (Table S1) (Bahram et al., 2018; Dodsworth et al., 2013; Mitchell et al., 2018; Seitz et al., 2016; Tran et al., 2018). The forward and reverse reads were merged by the program pear (options: -q 30) (Zhang et al., 2014). Merged sequences were searched against the arCOG, COG, eggNOG, KEGG and SCycDB databases with the diamond program (options: -k 1 -e 0.0001, -p 20) (Buchfink et al., 2015). Sequences matched to SCycDB were extracted to generate functional profiles of S microbial communities. These sequences were subsequently used to generate taxonomic profiles of S cycling microbial communities at different taxonomic levels using kraken2 (Wood et al., 2019). One-way analysis of variance (ANOVA) was performed with the ibm spss 22 (SPSS Inc.), and used to compare the abundances of gene families and taxonomic groups among different habitats.

3 RESULTS

3.1 Summary of gene families and pathways in SCycDB

The constructed SCycDB contains 585,055 sequences covering 207 gene families involved in eight key S cycling pathways, including assimilatory sulphate reduction, dissimilatory S reduction and oxidation, S reduction, SOX systems, S oxidation, S disproportionation, organic S transformation, and linkages between inorganic and organic S transformation; S compound transporters are also included as “others” (Table 1; Table S2, Figures S1–S3).

TABLE 1. Summary of sulphur (S) cycling gene families with representative sequences and orthology groups
Pathways Gene Annotation Core database sequences Full database sequences Orthology groups
arCOG COG eggNOG KEGG
Assimilatory sulphate reduction cysC Adenylylsulphate kinase 13,037 20,875 39 51 1025 135
cysND Sulphate adenylyltransferase 18,468 29,706 19 38 593 91
cysH Phosphoadenosine phosphosulphate reductase 7962 14,572 10 10 234 32
cysIJ Sulphite reductase 8030 19,497 11 21 608 81
cysNC Bifunctional enzyme CysN/CysC 4029 6383 12 12 371 43
cysQ 3′(2′), 5′-bisphosphate nucleotidase 9548 14,916 15 28 534 72
nrnA Bifunctional oligoribonuclease and PAP phosphatase 308 2,972 2 3 37 10
sat Sulphate adenylyltransferase 4307 6017 8 11 270 47
sir Sulphite reductase (ferredoxin) 418 2517 1 3 95 8
Dissimilatory sulphur reduction and oxidation aprAB Adenylylsulphate reductase 92 558 9 8 36 18
dsrAB Dissimilatory sulphite reductase 11,869 11,741 11 13 186 31
dsrC Dissimilatory sulphite reductase related protein 35 247 0 0 1 0
dsrDNT Protein DsrD DsrN DsrT 19 167 2 NA 3 3
dsrEFH Sulfurtransferase 63 423 NA 1 5 1
dsrL NADPH: acceptor oxidorductase DsrL 10 43 3 6 19 3
dsrMKJOP Membrane-Bound DsrMKJOP complex 65 626 2 4 24 11
qmoABC Quinone-modifying oxidoreductase 33 417 6 NA 23 2
rdsr Reverse dissimilatory sulphite reductase 81 115 NA NA 6 2
sat Sulphate adenylyltransferase 4307 6017 8 11 270 47
Sulphur reduction asrABC Anaerobic sulphite reductase 716 1833 11 16 35 15
fsr Sulphite reductase (coenzyme F420) 3 26 4 4 9 3
hydABDG Sulphhydrogenase 7 161 2 2 2 2
mccA Dissimilatory sulphite reductase 6 15 NA 1 3 NA
otr Octaheme tetrathionate reductase Otr 24 248 2 1 12 2
psrABC Polysulphide reductase 6 172 NA 0 6 3
rdlA Putative rhodanese-like protein 8 54 NA NA 2 NA
shyABCD Sulphhydrogenase 2 12 313 0 2 2 2
sreABC Sulphur reductase 12 66 0 1 3 1
sudAB Sulphide dehydrogenase 119 2574 9 20 54 16
ttrABC Tetrathionate reductase 1792 6084 18 17 175 39
SOX systems soxAX l-cysteine S-thiosulphotransferase 1734 4500 5 8 248 34
soxB S-sulfosulfanyl-l-cysteine sulphohydrolase 1357 2068 2 3 64 21
soxCD Sulphite dehydrogenase 1069 3491 4 14 240 27
soxYZ Sulphur-oxidizing protein SoxYZ 2065 4939 3 10 142 26
Sulphur oxidation doxAD Thiosulphate dehydrogenase [quinone] 3 31 0 NA 1 0
fccAB Sulphide dehydrogenase 75 962 1 1 36 6
glpE Thiosulphate sulfurtransferase 3727 8073 1 11 74 22
soeABC Sulphite dehydrogenase 6 617 NA 2 6 2
sorAB Sulphite cytochrome c oxidoreductase 82 918 1 4 20 8
sqr Sulphide:quinone oxidoreductase 89 559 0 2 1 1
sseA Thiosulphate sulfurtransferase 30 3165 0 1 17 5
tsdAB Thiosulphate dehydrogenase 21 1047 NA 0 6 2
Sulphur disproportionation phsABC Thiosulphate reductase 506 1343 4 7 39 14
tetH Tetrathionate hydrolase TetH 3 22 NA NA 0 NA
sor Sulphur oxygenase/reductase 9 29 0 NA 0 0
Organic sulphur transformation acuI Acrylyl-CoA reductase AcuI 458 4301 1 7 121 29
acuNK Acrylyl-CoA transferase and hydratase 4 63 3 3 9 13
betAB Betaine biosynthesis protein 9452 32,160 9 28 398 116
betC Choline-sulphatase 1689 4490 NA 2 75 12
comABCDE Coenzyme M biosynthesis protein 1933 4377 10 10 221 27
dddAC 3-Hydroxypropionate dehydrogenase 8 67 NA 2 6 3
dddDKLPQWY Dimethlysulphonioproprionate lyase 48 441 1 1 8 4
dddT Betaine/carnitine/choline transporter 2 103 NA NA 7 2
ddhABC Dimethylsulphide dehydrogenase 11 51 1 2 4 4
dmdABCD Dimethylsulphoniopropionate demethylation protein 156 6234 2 6 34 21
dmoA Dimethyl-sulphide monooxygenase 213 1900 2 2 36 9
dmsABC Anaerobic dimethyl sulphoxide reductase 2670 16,676 27 29 245 67
dsyB DsyB 94 64 NA NA 20 3
gdh Glutamate dehydrogenase (NADP+) 32 3931 0 1 10 2
hpsN Sulphopropanediol 3-dehydrogenase 35 976 1 NA 6 3
hpsOP R or S-dihydroxypropanesulphonate−2-dehydrogenase 4 122 1 2 4 11
iseJ Isethionate dehydrogenase 3 6 NA 1 1 3
isfD Sulphoacetaldehyde reductase 143 2726 2 6 22 18
mddA Methanethiol S-methyltransferase 6 315 NA 0 2 0
mdh Malate dehydrogenase 28,829 30,400 38 67 858 155
mtsAB Methylthiol:coenzyme M methyltransferase 5 37 0 NA 2 0
prpE Propionate–CoA ligase 1295 7177 5 4 93 21
pta Phosphate acetyltransferase 6697 20,805 27 33 736 110
sfnG Dimethylsulphone monooxygenase 10 721 1 0 4 1
slcCD Sulpholactate dehydrogenase 57 3216 5 4 43 29
sqdBDX Sulpholipid sulphoquinovosyl diacylglycerol biosynthesis protein 204 2045 1 6 28 13
tauXY Taurine dehydrogenase 24 140 NA 2 10 4
tmm Trimethylamine monooxygenase 24 382 NA NA 2 1
toa Taurine:2-oxoglutarate transaminase 3 226 2 2 5 3
tpa Taurine-pyruvate aminotransferase 128 2263 5 4 17 21
yihQ Sulphoquinovosidase 35 816 NA 0 2 1
Linkages between inorganic and organic sulphur transformation cuyA l-cysteate sulpho-lyase 62 1116 1 0 8 2
cysEKMO Cysteine biosynthesis protein 25,547 62,541 57 90 1389 269
hdrABCDE Heterodisulphide reductase 122 1348 39 35 94 26
mccB Cystathionine gamma-lyase/homocysteine desulphydrase 176 1238 NA 7 49 22
metABCXYZ l-Cystathionine biosynthesis protein 16,131 52,563 29 39 1103 171
msmAB Methanesulphonate monooxygenase 9 36 NA 1 3 1
mtoX Methanethiol oxidase 11 27 1 1 3 1
ssuDE Alkanesulphonate monooxygenase 7663 15,664 4 8 290 44
suyAB (2R)-sulpholactate sulpho-lyase 138 1182 3 1 16 11
tauD Taurine dioxygenase 1129 6531 1 1 34 14
tbuBC Toluene−3-monooxygenase 7 90 NA 2 2 1
tmoCF Toluene−4-monooxygenase 10 140 NA 1 4 4
touCF Toluene o-xylene monooxygenase 5 14 NA 2 4 1
xsc Sulphoacetaldehyde acetyltransferase 1026 2130 1 6 99 16
Others cuyZ Sulphite exporter 1 6 NA 1 5 1
cysAPUWZ Sulphate/thiosulphate transporter 18,458 38,216 93 138 1628 415
hpsKLM Dihydroxypropanesulphonate transporter 5 172 NA NA 4 NA
iseKLM Isethionate TRAP transporter 3 153 NA NA NA 1
sbp Sulphate-binding protein 855 5602 2 0 71 7
sgpABC Sulphur globule protein 10 288 2 NA 189 25
soxL Sulphur transferase, periplasm 2 38 NA 1 1 1
ssuABC Sulphonate transport system 9787 34,648 43 104 970 347
sulP Sulphate permease 542 4727 31 16 729 82
tauABC Taurine transport system 4332 13,826 20 57 270 143
tauE Sulphite/organosulphonate exporter 1 32 NA NA 3 NA
tauZ Membrane protein TauZ 7 236 NA 1 14 2
tusA Sulphur carrier protein TusA 3431 3907 2 6 39 14
tusBCDE tRNA 2-thiouridine synthesizing protein 5641 10,234 7 9 132 24

Note

  • The gene families responsible for identical reactions were combined together. More detailed information is provided in Table S2. NA: not detected in the database.

3.1.1 Assimilatory sulphate reduction

A total of 11 gene families with 117,455 representative sequences and 4580 homologous orthology groups are included for this pathway (Table 1; Figure S1A). Among these, gene families including cysD, cysN and sat participate in sulphate activation to adenosine 5′-phosphosulphate (APS), and cysC converts APS to phosphoadenosine 5′-phosphosulphate (PAPS). The gene family cysNC encodes the biofunctional enzyme CysN/CysC responsible for sulphate conversion to PAPS, cysH for PAPS reduction to sulphite, and cysI, cysJ and sir for sulphite reduction to sulfide.

3.1.2 Dissimilatory sulphur reduction and oxidation

Twenty-two gene families with 20,354 sequences and 775 homologous orthology groups are covered for dissimilatory S reduction and oxidation (Table 1; Figure S1B). The gene family sat participates in the conversion between sulphate and APS, and aprAB and qmoABC for the transformation between APS and sulphite. The dsr gene families are involved in both dissimilatory S reduction and oxidation, with some members of the gene families (e.g., dsrAB, dsrC, dsrD, dsrEFH, dsrL, dsrMKJOP) responsible for the transformation between sulphite and sulphide.

3.1.3 SOX systems

Seven gene families, including soxA, soxB, soxC, soxD, soxX, soxY and soxZ, are involved in SOX systems for thiosulphate oxidation to sulphate (Table 1; Figure S1C). The SOX system genes encode SoxAX, SoxYZ, SoxB and SoxCD proteins. A total of 14,998 sequences and 851 homologous orthology groups are included in SCycDB.

3.1.4 Sulphur reduction

The S reduction pathway contains 26 gene families encoding sulphite reductase, tetrathionate reductase, S reductase and polysulphide reductase with a total of 11,546 representative sequences and 496 homologous orthology groups (Table 1; Figure S1D). Among these, asrABC, fsr and mccA are responsible for sulphite reduction to sulphide, otr and ttrABC for tetrathionate reduction to thiosulphate, sreABC and psrABC for elemental S reduction and polysulphide reduction, respectively, and hydABDG, shyABCD and sudAB for the reduction of both elemental S and polysulphide to sulphide.

3.1.5 Sulphur oxidation

A total of 14 gene families are involved in S oxidation pathways with a total of 15,372 representative sequences and 231 homologous orthology groups (Table 1; Figure S1D). The fccAB and sqr gene families participate in sulphide oxidation, doxAD, glpE, sseA and tsdAB in thiosulphate oxidation, and soeABC and sorAB in sulphite oxidation.

3.1.6 Sulphur disproportionation

Gene families such as phsABC, tetH and sor are included for this pathway with 1394 sequences and 64 homologous orthology groups (Table 1; Figure S1D). Among these, phsABC gene families encode thiosulphate reductase responsible for the transformation of thiosulphate to sulphite and sulphide, tetH for the disproportionation of tetrathionate to elemental S, thiosulphate and sulphate, and sor for the conversion of elemental S to sulphite and sulphide.

3.1.7 Organic sulphur transformation

There are 57 gene families involved in organic S transformation with a total of 147,231 sequences and 4103 homologous orthology groups (Table 1; Figure S2). Among these, the gene family dsyB encodes methyltransferase, a key enzyme for DMSP biosynthesis. For DMSP degradation, two pathways are involved, including the cleavage pathway with dddDKLPQWY encoding DMSP lyase for the conversion of DMSP to DMS and acrylate, and the demethylation pathway with dmdABCD for the conversion of DMSP to methylmercaptopropionate (MMPA), further generating methanethiol (MeSH) and acetaldehyde (Curson et al., 2011; Moran & Durham, 2019; Moran et al., 2012). The acuI, acuKN and prpE gene families participate in acrylate utilization and detoxification, while the dmsABC, ddhABC and tmm families are involved in the transformation between DMS and sulphoxide (DMSO) (Bilous et al., 1988; Lidbury et al., 2016; Wang et al., 2017). Other diverse organic S compounds, such as sulpholipid, sulphonate and sulphate ester, are also involved in organic S metabolism. Two enzymes encoded by sqdB and sqdDX are related to the biosynthesis of sulpholipid sulphoquinovosyl diacylglycerides (SQDG), while sulphoquinovosidase encoded by yihQ subsequently converts SQDG to sulphoquinovose (SQ) (Moran & Durham, 2019; Speciale et al., 2016). The tauXY, toa, tpa and iseJ gene families are responsible for C2 sulphonate (taurine, isethionate) conversion to sulphoacetaldehyde, and xsc and pta for the transformation of sulphoacetaldehyde to acetyl-CoA (Durham et al., 2019; Landa et al., 2019). The hpsOPN and slcCD gene families are related to the transformation of C3 sulphonate DHPS, and betABC associated with the utilization of sulphate ester choline-o-sulphate (Landa et al., 2019).

3.1.8 Linkages between inorganic and organic sulphur transformation

There are 35 gene families responsible for linking inorganic and organic S transformation with a total of 144,620 sequences and 4011 homologous orthology groups (Table 1; Figure S3). The hdrABCDE gene families encode a heterodisulphide reductase-like system, which links DMS oxidation with thiosulphate reduction (Koch & Dahl, 2018). Gene families including cuyA, msmAB, ssuDE, suyAB, tbuBC, tmoCF, touCF and xsc link the transformation between organic S compounds (such as alkanesulphonate, l-cysteate, methanesulphonate, sulpholactate, sulphoacetaldehyde and taurine) and sulphite, with other gene families, namely cysEKMO, mccB, metABCXYZ and mtoX, linking the transformation between organic S compounds (such as l-cysteine, l-homocysteine, l-serine and MeSH) and sulphide (Byrne et al., 1995; Landa et al., 2019; Moran & Durham, 2019; Wasmund et al., 2017).

3.1.9 Others

Thirty-one gene families encoding various transporters for sulphate, sulphite, thiosulphate and organic S compounds are also included in SCycDB with a total of 112,085 sequences and 5650 homologous orthology groups (Table 1).

3.2 A comparison of gene families detected by SCycDB and other orthology databases

To evaluate the coverage of S cycling gene families in SCycDB, the developed SCycDB was compared with other publicly available orthology databases including arCOG, COG, eggNOG and KEGG. Several critical issues affecting accurate functional assignments in metagenomics were noted. First, there are 207 gene families in SCycDB, while only 62, 130, 138 and 152 gene families are included in the arCOG, COG, eggNOG and KEGG orthology databases, respectively (Figure S4). Second, several key S cycling gene families are included in SCycDB but missing in these four public orthology databases, such as gene families for dissimilatory S reduction and oxidation (dsrMKJOP), sulphur reduction (mccA, otr, rdlA), sulphur oxidation (sorAB), sulphur disproportionation (tetH), organic sulphur transformation (dddAC, dddKQWY, dsyB) and others (Figure 2). Third, in the four public orthology databases, many different gene families defined by SCycDB were merged into one orthologous group; conversely, single gene families with distinct classification in SCycDB could be correctly found in multiple orthologous groups (Table S3). For instance, dsrAB, asrC and fsr for different S reduction pathways are merged into one orthology group in COG and eggNOG (Table S3). Similarly, phsA and psrA are not clearly distinguished in COG, eggNOG and KEGG (Table S3), and they were phylogenetically separated in SCycDB (Figure S5). Therefore, SCycDB, specifically designed to target gene families involved in S metabolism, has advantages over existing orthology databases in terms of coverage, representativeness and accuracy.

Details are in the caption following the image
A comparison of S cycling gene families in SCycDB with other public orthology databases. Different colours in the heatmap represent coverage of the selected S cycling gene families in corresponding orthology databases. SCycDB was used as a reference for the comparison. Grey colour indicates the absence of this gene family in the public orthology databases [Colour figure can be viewed at wileyonlinelibrary.com]

3.3 Taxonomic composition of S cycling genes and pathways in SCycDB

To understand the taxonomic composition of S cycling genes and pathways in SCycDB, we mapped sequences targeting S cycling genes and pathways to their affiliated reference genomes from the NCBI RefSeq. In total, the developed SCycDB covers 47 phyla, 82 classes, 197 orders, 461 families and 2562 genera of bacteria, and five phyla, 12 classes, 22 orders, 37 families and 122 genera of archaea (Table 2). For bacteria, Proteobacteria (this phylum covers 91.3% of the genes), Firmicutes (67.6%), Actinobacteria (62.8%) and Bacteroidetes (44.0%) are the dominant phyla, with Pseudomonas (this genus covers 51.7% genes), Escherichia (45.9%), Bacillus (45.4%) and Vibrio (36.7%) representing the dominant genera in SCycDB (Table S4). Further analysis shows that organic S transformation has the highest coverage of microorganisms, containing 42 phyla and 2289 genera, and especially assimilatory sulphate reduction accounts for one of the largest coverage groups with 40 phyla and 2059 genera, while 40 phyla and 2204 genera are involved in the linkages between inorganic and organic S transformation (Table 2). For archaea, Euryarchaeota, Crenarchaeota, Thaumarchaeota, Candidatus Bathyarchaeota and Candidatus Korarchaeota are the dominant phyla in SCycDB (Table S4). At the genus level, organic S transformation has the highest diversity with the involvement of 84 genera, followed by assimilatory sulphate reduction (81 genera), and linkages between inorganic and organic S transformation (76 genera) (Table 2). These results indicate that SCycDB covers a high diversity of microorganisms participating in the S cycle, providing a useful platform for the search and annotation of S cycling genes, pathways and associated key microorganisms in the environment.

TABLE 2. Summary of S cycling pathways and the number of taxa covered at different taxonomic levels in SCycDB
Pathway Phylum Class Order Family Genus
Archaea Bacteria Archaea Bacteria Archaea Bacteria Archaea Bacteria Archaea Bacteria
Assimilatory sulphate reduction 5 40 9 74 17 179 25 417 81 2059
Dissimilatory sulphur reduction and oxidation 4 29 8 52 16 117 21 235 49 657
Sulphur reduction 3 23 7 44 11 88 19 173 27 392
SOX systems 1 19 4 35 6 91 8 192 12 683
Sulphur oxidation 2 22 3 34 4 79 7 167 8 506
Sulphur disproportionation 2 4 2 11 2 24 2 37 3 53
Organic sulphur transformation 5 42 12 73 21 182 32 437 84 2289
Linkages between inorganic and organic sulphur transformation 5 40 11 74 19 177 27 405 76 2204
Others 5 33 9 66 13 163 19 378 52 1746
Total 5 47 12 82 22 197 37 461 122 2562
NCBI RefSeq 25 156 18 98 29 224 49 530 194 3815

3.4 Application of SCycDB for functional and taxonomic profiling of environmental samples

We applied SCycDB and four other orthology databases (arCOG, COG, eggNOG and KEGG) to profile S cycling microbial communities from four habitats: freshwater, hot spring, marine sediment and soil (Figures 3 and 4; Figure S6). The number of S cycling gene families detected by searching against SCycDB ranged from 174 to 188 in the four habitats, which was significantly (p < .05) greater than the other four databases (55–58 in arCOG, 125–128 in COG, 129–134 in eggNOG, 120–135 in KEGG). Notably, the run-time with SCycDB (418–2264 s) was much shorter than with eggNOG (3625–17,749 s) and KEGG (2243–11,161 s) (Table S5).

Details are in the caption following the image
Relative abundances of S cycling gene families annotated by SCycDB in four habitats. (a) Assimilatory sulphate reduction and dissimilatory S reduction and oxidation; (b) SOX systems, S reduction, oxidation and disproportionation; (c) organic S transformation and linkages between inorganic and organic S transformation. Functional profiling of those microbial communities identified 154–193 gene families and 112,417–213,847 sequences in those four habitats at a random subsampling of 4,710,299 sequences per sample. Data are presented as mean ± SE (standard error, n = 6). Different letters (“a,” “b” or “c”) indicate a statistically significant difference (p < .05) of each gene family among four habitats. FW, freshwater; HS, hot spring; MS, marine sediment; S, soil
Details are in the caption following the image
Relative abundances of S cycling microbial communities annotated by SCycDB at the genus level. Taxonomic profiling of S cycling microbial communities identified 32–43 phyla and 692–1340 genera in the four habitats at a random subsampling of 4,710,299 sequences per sample. Data are presented as mean ± SE (n = 6). Different letters (“a,” “b” or “c”) indicate a statistically significant difference (p < .05) of each genus among four habitats. FW, freshwater; HS, hot spring; MS, marine sediment; S, soil

SCycDB profiles of S cycling microbial communities showed that the overall functional or taxonomic composition was significantly different (p < .05) among the four habitats profiled in this study (Table S6, Figure S7). Functional profiling of the microbial communities showed S cycling functional genes and pathways were differentially enriched in different habitats (Figure 3). For example, soil habitat exhibited the highest abundance of gene families involved in SOX systems (soxAX, soxCD, soxYZ), and S oxidation (sorAB, tsdAB) as well as DMSP biosynthesis and degradation (dsyB, dddDKLPQWY, dmdABCD, acuNK), with marine sediment having particularly high abundances of S reduction gene families (asrABC, shyABCD and ttrABC) and DMS transformation genes (ddhABC, dmsABC) (Figure 3). Taxonomic profiling of S cycling microbial communities showed that Proteobacteria was the dominant phylum of S cycling microbial communities in all four habitats (Figure S8), which is consistent with the copious representation of Proteobacteria in SCycDB (Table S4). At the genus level, the abundance of Desulfallas, Desulfobacter, Desulfococcus, Desulfomonile, Desulfotomaculum and Syntrophobacter for dissimilatory sulphate reduction, S reduction and disproportionation was higher in marine sediment than in the other three habitats (Figure 4). In contrast, the soil habitat showed high abundances of Halomonas, Pseudomonas, Rhodobacter, Roseobacter, Roseovarius, Ruegeria, Sagittula and Sulfitobacter, which are related to DMSP production and degradation (Figure 4). The above results show that SCycDB is a powerful tool to facilitate analysis of shotgun metagenome sequencing data, enabled by the capacity for fast, comprehensive and accurate functional and taxonomic profiling of S cycling microbial communities in various environments.

4 DISCUSSION

The S cycle is an important biogeochemical process largely driven by microorganisms, impacting the cycling of C and N as well as global change (Curson et al., 2011; Landa et al., 2019; Muyzer & Stams, 2008; Wasmund et al., 2017). Characterizing the function and taxonomy of microbial processes involved in S cycling is critical in providing a better understanding of the diversity of S cycling microbial populations and specific impacts on the environment. Here, we developed SCycDB for fast and accurate functional and taxonomic profiling of S cycling microbial communities, and subsequently applied this database for metagenome sequencing data analysis. The results demonstrate that SCycDB is a useful tool to profile S cycling microbial communities from different environments. To our knowledge, this is the first comprehensive, specific database for analysing both functional and taxonomic profiling of S cycling microbial communities.

Manually curated databases are of vital importance to improve the reliability and reproducibility during bioinformatics analysis of metagenomic data (Kanehisa et al., 2016; The Uniprot Consortium, 2017). Automatically generated orthology databases, including arCOG, COG, eggNOG and KEGG, cover 62–152 gene families involved in microbial S cycling (Galperin et al., 2015; Huerta-Cepas et al., 2019). In comparison, SCycDB is much more comprehensive, covering 207 gene families with 585,055 representative sequences. The gene families included in SCycDB are retrieved manually based on publicly available databases and most up-to-date knowledge of S cycling. For instance, SCycDB covers gene families otherwise not included in existing databases, such as those involved in DMSP synthesis (dsyB) (Curson et al., 2017), acrylate utilization and detoxification (acuNK and dddAC) (Wang et al., 2017), and DMSP cleavage (dddKQTWY) (Li et al., 2017; Moran & Durham, 2019; Peng et al., 2019), enabling researchers to study these newly discovered gene families and metabolic pathways. These gene families have no clearly defined orthology groups in other publicly available databases, but play important roles in regulating marine S cycling and mediating the climate-active gas DMS (Moran & Durham, 2019). Also, SCycDB includes not only commonly known gene families including dsrAB, dsrC and dsrEFH, but also other poorly known dsr gene families (e.g., dsrMKJOP, dsrL, dsrN and dsrT) for dissimilatory S reduction and oxidation (Hausmann et al., 2018; Löffler et al., 2020; Pires et al., 2006). In addition, to facilitate both functional annotations and taxonomic assignments that require more accurate sequences with taxonomic information (Tu et al., 2019), the NCBI RefSeq database has been integrated into SCycDB to increase the coverage of functional gene sequences and their associated taxonomic information. Therefore, SCycDB provides the much desired ability to explore questions of “who is there” and “what they are doing” in microbial ecology.

Accuracy is critical in metagenome sequencing data analysis, which is largely dependent on reference databases (Quince et al., 2017). The SCycDB ensures its annotation accuracy in three major aspects. First, gene families and annotations have one-to-one corresponding relationships. As automatically generated orthology databases identify orthologous groups based on species-aware clustering algorithms (Huerta-Cepas et al., 2019), they could not clearly distinguish different homologous genes. For example, gene families psrA and phsA respectively encoding polysulphide reductase and thiosulphate reductase subunit are highly homologous, and thus they are always mis-annotated as a single orthology group in automatically generated orthologuey databases. In ScycDB, we have carefully looked into this issue, and manually separated them into two orthology groups. Second, SCycDB reduces potential mis-annotations, which may occur in automatically generated orthology databases. For example, the cysC gene sequences are generally grouped with sat sequences, resulting in the possibility of mis-annotations. Such occurrences are not uncommon, as found in cysP vs. sbp, metB vs. mccB, and sreA vs. soeA. Particularly problematic is the observation that a sequence may be assigned to more than one orthologous group. Therefore, we have manually checked those sequences and carefully assigned them to the correct gene groups to reduce possible mis-annotations in SCycDB. Third, several databases for profiling specific gene families, such as ARDB (for antibiotic resistance genes) and NCycDB (for N cycling genes) were recently developed (Liu & Pop, 2009; Tu et al., 2019). False positives could be an issue arising from the relatively small size of these specialized databases (Tu et al., 2019). To solve such a “small database” issue, SCycDB deliberately includes S cycling-related homologous orthology groups identified from multiple publicly available orthology databases. Therefore, the accuracy of annotation has been considerably enhanced with the implementation of these features.

Unlike other orthology databases, SCycDB is specific to profile S cycling microbial communities, resulting in fast annotation of functional genes, pathways and taxonomy. As shotgun metagenome sequencing data increase exponentially, fast processing of metagenome data sets is critical for metagenomic studies (Kim et al., 2013; Scholz et al., 2012; Wood & Salzberg, 2014; Zhou et al., 2015). A study of the taxonomic classifier MetaPhyler showed that it was much faster than other tools (PhymmBL, MEGAN, WebCarma) as its reference database was smaller than a general reference database (Liu et al., 2011). Also, a specific database NCycDB provides a fast profiling platform to identify N cycling gene families (Tu et al., 2019). In our study, we used 370 G metagenome data sets and ran on 20 CPU threads, resulting in run times of ~8, 66 and 42 h for SCycDB, eggNOG and KEGG, respectively. Therefore, SCycDB is a much faster database for the annotation of S cycling microbial communities in metagenomic studies.

Functional and taxonomic profiles are important objectives in shotgun metagenome sequencing data analysis to understand microbial communities from different environments (Knight et al., 2018; Quince et al., 2017). Accurate functional profiling requires comprehensive sequence databases for specific metabolic pathways, which is frequently unavailable. Using S metabolism as an example, several previous metagenomic studies only focused on inorganic S cycling, especially dissimilatory sulphate reduction (Baker et al., 2015; Hausmann et al., 2018; Vavourakis et al., 2019), probably due to a lack of organic S cycling gene families in the reference database. In this study, we included organic S cycling in SCycDB, and used it to analyse functional and taxonomic profiles of S cycling microbial communities from four types of environments, providing a full picture of microbial communities in natural ecosystems. Our results revealed a high diversity of S cycling gene families (154–193 gene families) and microorganisms (32–43 phyla and 692–1340 genera) in natural environments, especially for organic S transformation microbial communities. Also, we found significant variations in functional and taxonomic composition and structure of S cycling microbial communities among different environments. For instance, higher abundances of S reduction gene families and microorganisms in marine sediments were observed, probably linked to the importance of anaerobic respiration with S compounds as electron acceptors in the marine sediment (Jørgensen et al., 2019; Wasmund et al., 2017). Gene families and microorganisms involved in DMSP and DMS transformation were detected in all four environments, supporting the universal distribution of DMSP and DMS metabolism (Curson et al., 2011, 2017; Moran & Durham, 2019). Indeed, DMSP accounts for 10% of fixed carbon in marine environments and DMS plays an important role in S exchanges between the ocean and atmosphere (Curson et al., 2011; Landa et al., 2019; Todd et al., 2010). Consistently, the abundances of DMS transformation gene families were higher in marine sediment than in the other three environments. However, we identified a high abundance of DMSP biosynthesis and degradation gene families as well as associated microorganisms in the soil habitat, suggesting that DMSP transformation may also be an important process in soil. Therefore, these results demonstrate the vast diversity and importance of microbial S metabolisms in the environment that remain to be explored, which will be greatly facilitated by SCycDB developed in this study.

In summary, SCycDB is a manually curated, comprehensive database for fast and accurate functional and taxonomic analysis of S cycling microbial communities with shotgun metagenome sequencing data. By integrating multiple publicly available databases, the current SCycDB contains 207 gene families and 585,055 representative sequences as well as 20,761 homologous orthology groups to resolve the “small database” issue. Applied to profile S cycling microbial communities from various environments, SCycDB has demonstrated its utility for exploring the S cycling process and associated microbial communities in the environment. The SCycDB developed here provides a comprehensive and fast metagenomic analysis tool specialized for studying S metabolisms that will be periodically updated.

ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China (Grant Nos. 91951207, 31770539, 31971446, 31700427, 92051110), National Key Research and Development Program of China (Grant Nos. 2019YFA0606700, 2020YFA0607600, 2017YFA0604300) and Natural Sience Foundation of Shandong Province (Grant No. ZR201911110287).

    CONFLICT OF INTEREST

    The authors declare that they have no known competing interests.

    AUTHOR CONTRIBUTIONS

    Q.T. and Z.H. designed the database structure. X.Y., J.Z., W.S. and M.X. searched and manually collected the sequences. Q.T. wrote the scripts for database construction. X.Y. constructed the database and drafted the manuscript. Q.H., Y.P., Y.T., C.W., L.S., S.W., Q.Y., J.L., Q.T. and Z.H. revised the manuscript. All authors read and approved the final manuscript.

    DATA AVAILABILITY STATEMENT

    SCycDB database files are available at https://github.com/qichao1984/SCycDB.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.