Volume 2025, Issue 1 9676659
Research Article
Open Access

Improved Key Microbial Biomarker Discovery Using Ensemble Statistical Methods

Walter Pirovano

Corresponding Author

Walter Pirovano

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Department of Complex Trait Genetics , Centre for Neurogenomics and Cognitive Research , Amsterdam Neuroscience , Vrije Universiteit , Amsterdam , the Netherlands , vu.nl

Search for more papers by this author
Yashjit Gangopadhyay

Yashjit Gangopadhyay

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author
Mirna Lilian Baak

Mirna Lilian Baak

Bioinformatics Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author
Christiaan Arie de Leeuw

Christiaan Arie de Leeuw

Department of Complex Trait Genetics , Centre for Neurogenomics and Cognitive Research , Amsterdam Neuroscience , Vrije Universiteit , Amsterdam , the Netherlands , vu.nl

Search for more papers by this author
Radhika Bongoni

Radhika Bongoni

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author
Eline Suzanne Klaassens

Eline Suzanne Klaassens

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author
First published: 20 May 2025
Academic Editor: Jiong Yu

Abstract

In recent years, there has been a growing awareness of the importance of the microbiome in health and disease. Consequently, the number of large microbiome-related clinical trials has also significantly increased. However, advanced biostatistical analysis is required to properly combine microbiome taxonomic abundance data with phenotypical metadata and reliably predict disease states. While differential abundance analysis and machine-learning techniques are widely used to perform such analyses, selecting the best method is not trivial due to the complexity and specific characteristics of both the data and the algorithms. Here, we present a consensus-based key microbial biomarker (KMB) biostatistical analysis framework that links microbial abundance obtained from amplicon-based or shotgun metagenome sequencing with metadata. The framework integrates machine learning (ML) algorithms and statistical methods to determine the most relevant microbial biomarkers and signatures that explain variation in the microbial abundance counts and metadata classes based on predefined metrics. We evaluated the performance of our framework on publicly available case-control datasets of colorectal cancer, Alzheimer’s disease, and Parkinson’s disease and show that, compared to individually run methods, the combined approach is better able to detect KMB species and signatures associated with health and disease conditions. We conclude that our proposed KMB framework provides an innovative and robust strategy that can contribute to the further development of improved diagnostic tools for early disease detection, personalized medicine design, patient stratification, and a better general understanding of the mechanisms behind observed results in pre and postclinical trials.

1. Introduction

The gut microbiome comprises a wealth of bacteria, archaea, viruses, and eukaryotes. Together, these microbiota play a crucial role in protecting the human host against diseases and in maintaining a ‘healthy’ gut microbiome [1]. In recent years, an increasing number of studies have indicated that numerous human diseases and disorders are associated with disruptions in the microbiome, leading to an imbalanced composition. These include colorectal cancer (CRC) [2, 3], inflammatory bowel disease (IBD) [4, 5], hypertension [6], diabetes and obesity [7, 8], Alzheimer’s disease (AD) [9, 10], and Parkinson’s disease (PD) [11, 12]. This phenomenon, known as dysbiosis [13], is typically correlated with individual lifestyle and clinical biomarkers such as diet [14], cholesterol [15], phosphorylated tau [16], and alpha-synuclein [17].

While a healthy gut is often associated with high microbial diversity [18], dysbiosis is characterized by reduced microbial diversity and compositional shifts. Detecting microbial biomarkers indicative of disease states is of crucial importance for the development of novel diagnostic tools [19] and personalized medicine, as some drugs are metabolized by key microbes before absorption [20]. Also, biomarker analysis deepens our understanding of the general mechanisms underlying host-microbiome interactions [21]. However, given the complexity and high dimensionality of microbiome data, advanced algorithms and statistical methods are essential to explore host-microbiome interactions and reliably predict disease states.

In this context, machine learning (ML) algorithms such as random forest (RF), logistic regression (LR) and linear discriminant analysis (LDA) have demonstrated significant value in associating taxonomic features of microbiomes with phenotypical observations [22, 23]. Methods like least absolute shrinkage and selection operator (LASSO) [24], Boruta [25], and linear discriminant effect size analysis (LEfSE) [26] are widely used to discover microbial biomarkers, though each has its strengths and limitations. Importantly, their predictive power depends on factors such as sample size, sequencing data quality, and provided features. The variability in experimental and analysis setups across studies highlights the importance of employing multiple methods of biomarker analysis [23].

Here, we present a novel, consensus-based key microbial biomarker (KMB) prediction framework that integrates microbial taxonomic abundance tables (generated from either amplicon-based or shotgun metagenomics data) with clinical metadata. The approach combines state-of-the-art ML techniques and statistical models with complementing characteristics. In the first step, linear regression with LASSO (elastic net approach), feature selection with Boruta (combined with RF), LEfSe, and differential abundance analysis (DAA) are used to identify an initial set of biomarkers. In the second step, the found markers are filtered and prioritized based on their significance using statistical methods including relative weight, Gini impurity, LDA, p value and q value.

Using four well characterized publicly available datasets (including CRC, AD, and PD - see Table 1), we demonstrate that integrating multiple (and diverse) strategies into a single framework, combined with an effective filtering and selection of candidate biomarkers, results in more consistent and reliable predictions. These predictions are less influenced by the specific dataset and features used. Consequently, the enhanced robustness of identified biomarkers improves our understanding of host-microbiome interactions in health and disease conditions. Furthermore, our findings highlight the clear advantages of shotgun metagenomics data over amplicon-based data to identify biomarkers at higher resolution taxonomies (i.e., species or strain level).

Table 1. Description of the datasets used for the evaluation of the KMB framework. Note that the taxonomic profiles of the Kostic et al. dataset were retrieved from the R package Phyloseq (https://bioconductor.org/packages/release/bioc/html/phyloseq.html, [55]). The taxonomic profiles of the Zeller et al. dataset were retrieved from the Supporting Information of the original publication [28].
Dataset Disease Cohort Sequencing method Data archive KMB input
Kostic et al. [46] Colecteral cancer (CRC) N = 185 (95 CRC vs. 90 C) 16S rRNA (V3-V5 region); 454 GS FLX Titanium NIH SRA archive (SRP000383) Taxonomic profiles
Zeller et al. [28] Colecteral cancer (CRC) N = 141 (53 CRC vs. 88 C) Metagenomics shotgun; Illumina HiSeq 2000/2500 ENA database (ERP005534) Taxonomic profiles
Ling et al. [44] Alzheimer’s disease (AD) N = 171 (100 AD vs. 71 C) 16S rRNA (V3-V4 region); Illumina MiSeq NIH SRA archive (SRP262626) Illumina sequences
Qian et al. [48] Parkinson’s disease (PD) N = 80 (40 PD vs. 40 C) Metagenomic shotgun; Illumina X Ten NCBI BioProject (PRJNA433459) Illumina sequences

Given that microbial species and strains often have very specific roles in disease progression and medication efficacy, high-resolution genomic data provide substantial insights into underlying functional mechanisms. This is crucial for the development of effective and safe microbiome-based therapies. Indeed, our investigation shows that the beneficial properties of bacteria observed in one condition may not be generalized to others, as their genomes may harbor functions that specifically influence a disease-related mechanism.

Finally, we discuss the effects and contributions of various ML methods evaluated in our study, and outline perspectives for future methodological development in microbiome analysis. We argue that the adoption of more reproducible and robust approaches is essential to advance our understanding of compositional and functional changes associated with health and disease.

2. Materials and Methods

2.1. KMB Analysis Framework

Our KMB analysis framework employs a multistep approach comprising data preprocessing (including filtering, normalization, and transformation), biomarker prediction (using DAA and varied ML methods), and finally the selection of microbial markers associated with specific conditions (typically health and disease). Depending on the input data, predictions can be made at the genus or at the species (or strain) level. In addition to identifying biomarker signatures, the framework also provides fold change values and statistical significance scores, including median relative weight, Gini index, LDA, p value and q value. An overview of the framework is shown in Figure 1.

Details are in the caption following the image
Flowchart of our key microbial marker (KMB) analysis framework. After preprocessing of the taxonomic abundance counts and metadata, an ensemble of different methods (differential abundance analysis, LASSO (elastic net), Boruta (Random Forest), and LEfSe (LDA)) is used to predict biomarkers. Through a rigorous filtering and prioritization step, including different statistical measures, the final biomarker signature is defined.

2.1.1. Data Preprocessing

In the first step, the raw genus or species abundance counts obtained from 16S rRNA gene amplicon or shotgun metagenomics sequences are converted into relative abundance counts. The counts are then filtered to retain only those with a frequency above 0.001%. Subsequently, the data is transformed into relative abundances based on the total sum per sample. Concurrently, a factor is created from the metadata labels column (e.g., disease or control) using R, enabling the storage of metadata as levels.

2.1.2. DAA (Association Testing)

In the second step, the dependencies between the preprocessed 16S rRNA gene (or shotgun) sequence features and the metadata factors are analyzed through association testing. For this, we utilize the SIAMCAT (statistical inference of associations between microbial communities and host phenotypes) R-toolbox developed by Wirbel et al. [27], building on the earlier work by Zeller et al. [28].

SIAMCAT performs DAA across two classes using the nonparametric Wilcoxon test with multiple hypothesis correction to assess the significance of associations. This analysis yields model metrics such as the p value and the FDR adjusted p value (q value), which can be used to evaluate the importance of selected taxa. Additionally, SIAMCAT calculates a nonparametric effect size measure for features using area under the receiver operating characteristic (AU-ROC) curve scores. For our analyses, we used the check.associations function of the SIAMCAT packages with default parameters.

While numerous differential abundance testing methods exist, we selected SIAMCAT due to its robust statistical approach and its extensive adoption in the field (see [2931]).

2.1.3. ML Methods

In parallel to the DAA, we applied three state-of-the-art ML methods to predict biomarkers: LASSO, Boruta, and LEfSe. Each method adopts a different approach for feature selection, as detailed below.

2.1.3.1. LASSO

LASSO [24] is a regression analysis method used for feature selection and regularization. Within our KMB framework, we utilized the SIAMCAT implementation of LASSO, which integrates the original LASSO method with ridge regression to form an elastic net.

In brief, we normalized features using the log.clr normalization method, followed by cross-validation using the create.data.split function (with num.folds = 10 and num.resample = 5). Practically, this means that each dataset was randomly split into 10 subsets, followed by 10 tests for which each time nine subsets were used for training and one subset for testing. The whole process was repeated five times. The resulting model was subsequently trained using the “enet” elastic net method (using the train.model function), followed by the extraction of the median relative feature (or operational taxonomic unit (OTU)) weights across the trained models, along with the percentage of models in which each feature is selected.

The key advantage of the elastic net implementation lies in its ability to linearly combine L1 (LASSO) and L2 (ridge) penalties. This penalizes the coefficients of less important features, resulting in more accurate predictions. This is particularly valuable in cases where the number of features (p) exceeds the number of samples (n), that is, p > n. Specifically,
  • LASSO selects at most n features as nonzero, thus effectively reducing the number of features in the model.

  • Ridge regression, on the other hand, minimizes coefficients but does not reduce the feature set, as no coefficient is forced to zero.

Consequently, LASSO is ideal for scenarios where only a limited number of microbial genes or taxa (features) impact the dysbiosis, whereas ridge regression is better suited for cases where many features have a similar impact on the dysbiosis. By combining these approaches, elastic net implementation models are less dependent on the specific characteristics of the dataset studied, enhancing their robustness.

LASSO, either standalone or combined with ridge regression, has been widely and successfully used to measure associations between microbiome composition and metadata (see [29, 32]). Additionally, LASSO was recently combined with sparse canonical correlation analysis to identify host gene-microbiome associations that influence host pathophysiology [33].

2.1.3.2. Boruta

Boruta [25] is a feature selection algorithm that extends to the RF approach developed by Breiman [34]. It was specifically built to address the limitations of RF when the number of feature variables significantly exceeds the number of samples, a common scenario in biological datasets, that is, where a large number of genes or microbial taxa (features) must be associated with a smaller number of samples.

The primary advantage of Boruta over traditional RF feature selection algorithms is its ability to capture all features that may potentially be relevant to the phenotype. Boruta operates as a wrapper around RF, using a built-in randomization function that randomly shuffles the datasets in multiple iterations to generate “shadow features.” These shadow features serve as a baseline for comparison, allowing the algorithm to iteratively remove features proven to be less relevant.

To train the RF model available within the Boruta R package and to evaluate its predictive performance, we followed the approach presented in Gong et al. [35]. This approach facilitates hyperparameter tuning while also selecting features for building the learning model. To summarize, we created a series of random test and training partitions using the createDataPartition function in R. By setting parameter p to 70, a total of 70% of the data (samples) was used for training. The remaining 30% of the data (samples) was used for testing and model performance evaluation. Subsequently, a repeated 5-fold cross-validation was applied for feature selection and selection.

While Boruta relies on Z-scores to quantify the accuracy loss (a measure of feature importance), RF uses the Gini index (also known as Gini impurity) as an indicator for feature importance. The Gini index measures the probability that a randomly selected feature to be incorrectly classified. Gini impurity reaches zero when all records in a group fall into a single category. This property is particularly useful in decision trees for determining the importance of a target variable in classifying the label. Features with a higher mean decrease in Gini are considered more important [36].

One limitation of Boruta is that while it identifies feature importance, it does not directly indicate how features are associated to a specific state (e.g., health or disease). Despite, Boruta is a robust and efficient feature selection method capable of handling high-dimensional datasets. It has been shown to outperform the standard RF approaches in various microbial biomarker classification studies [37, 38].

2.1.3.3. LEfSe

The ML algorithm LEfSe was developed by Segata et al. [26] and is specifically designed for biomarker discovery. It is currently one of the most frequently used methods in the field. In brief, LEfSe identifies microbial features that best explain differences among classes (phenotypes). First, statistically different features between classes are determined using the nonparametric factorial Kruskal–Wallis sum-rank test [39]. Subsequently, the (unpaired) Wilcoxon rank-sum test [40, 41] is applied to assess the biological consistency. Finally, LDA [42] is used to estimate the effect relevance. The method also offers an option to incorporate prior knowledge to constrain the (high) dimensionality of the data. As a result, the importance of the biomarkers found is ranked according to their log-LDA scores, derived from the effect size analysis, and the p values for microbial feature significance. LEfSe was previously validated on different metagenomic microbiome datasets (human, mouse, and environmental) highlighting the general applicability of the method (see [4345]).

2.1.4. Biomarker Selection

The ensemble of the DAA and the ML approaches such as LASSO, Boruta, and LEfSE generates a list of biomarker features (OTUs). These features are subsequently ranked by their degree of importance (effect sizes) based on the following criteria: associated q value (DAA), median relative weight (LASSO), Gini index (Boruta), and LDA scores and p value (LEfSe). Additionally, the DAA method calculates the log fold-change in taxa. We made use of conventional significance thresholds applied in the field. The chosen p and q value set DAA and LEfSe are standard significance thresholds. In addition, for LEfSe, the LDA score of 2 is the default value of the package. The AU-ROC threshold of 0.75 used for LASSO and Boruta is a frequently used value that indicates an acceptable model. The value is also halfway between 0.5 and 1 which indicate a random and a perfect model, respectively. For models that did not meet the 0.75 threshold, we used the Top 10 hits to still allow for comparison with other methods. The final consensus of important biomarker features is then defined as follows:
  • Select biomarkers found with:

    • a.

      DAA (q value ≤ 0.05);

    • b.

      LASSO (when AU-ROC ≥ 0.75, select features with percentage ≥ 0.9 and median relative weigth ≥ 0.05; when AU-ROC is between 0.5 and 0.75, select Top 10 most important features and mark with (§) to indicate lower accuracy of the prediction model);

    • c.

      Boruta (Top 10 contributing features based on the highest Gini index; when AU-ROC is between 0.5 and 0.75, mark features with (§) to indicate lower accuracy of the prediction model);

    • d.

      LEfSe (LDA score ≥ 2 and p value ≤ 0.05 or ≤ 0.001, when number of features is < 50, mark features with LDA score < 3 as less important (#)).

  • Keep biomarker signatures that are supported by DAA and at least one ML method.

  • Prioritize biomarkers based on the consensus between multiple methods and the significance levels. Note that the final list of biomarkers (as also presented in Tables 2, 3, 4, and 5) is sorted by the fold change as this is an interpretable and universal metric.

Table 2. Biomarker prediction results obtained on the colorectal cancer (CRC) dataset of Kostic et al. [46]. In bold are the genera that are found increased in CRC patients; in underline are the genera that are found increased in healthy controls. Results are sorted on increasing fold change values. For Boruta, the genera that most significantly contributed to the biomarker signature are provided (not related to a specific condition). Predictions that are based on a model with AU-ROC score < 0.75 are marked with §. Genera indicated with   were also identified in the original analysis by Kostic et al.
OTU (genus) LASSO Boruta LEfSe DAA Fold change
(mdn rel wt) (Gini ind) (p value) (q value)
Campylobacter 0.00048 0.00916 −0.9186
Fusobacterium 5.57e − 05 0.00248 −0.55353
Unclassified Clostridiales 0.01688 § 4.521 0.01112 0.25932
Bacteroides 0.01397 § 0.00875 0.28736
Ruminococcus 5.45715 0.03190 0.30229
Unclassified Ruminococcaceae 4.5878 0.00026 0.00361 0.32102
Eubacterium 4.1848 3.48e − 06 0.01112 0.35281
Faecalibacterium 0.00075 0.00047 0.4359
Parabacteroides 0.00011 0.00686 0.53516
Alistipes 3.56422 0.03167 0.57941
Bilophila 3.41084 0.00039 0.04964 0.67633
Unclassified Rikenellaceae 0.01402 § 0.00072 0.02788 0.71139
Collinsella 0.01856 § 1.36e − 05 0.00091 1.10529
Table 3. Biomarker prediction results obtained on the colorectal cancer (CRC) dataset of Zeller et al. [28]. In bold are the species/strains that are found increased in CRC patients; in underline are the species/strains that are found increased in healthy controls. Results are sorted on increasing fold change values. For Boruta, the taxa that most significantly contributed to the biomarker signature are provided (not related to a specific condition). Predictions obtained by LEfSe with an LDA score < 3 are marked with #. Taxa indicated with   were also identified in the original analysis by Zeller et al.
OTU (genus) LASSO Boruta LEfSe DAA Fold change
(mdn rel wt) (Gini ind) (p value) (q value)
Bacteroides fragilis [1090]  0.0069 0.03025 −0.89583
Fusobacterium nucleatum subsp. animalis [1481]  0.01915 2.71736 4.97e − 08 # 7.58e − 03 −0.87419
Peptostreptococcus stomatis [1530]  0.0182 2.74931 5.94e − 05 0.00301 −0.86725
Clostridium symbiosum [1600]  0.01231 0.00073 # 0.01576 −0.80188
Parvimonas micra [1505] 3.04311 0.02234 −0.65777
Fusobacterium nucleatum subsp. vincentii [1482]  0.03107 3.29898 1.58e − 04 −0.5973
Unclassified Parvimonas sp. [1507] 0.01161 0.00048 # 0.01219 −0.52338
Unclassified Parvimonas sp. [1506] 0.00071 # 0.01576 −0.51558
Fusobacterium nucleatum subsp. polymorphum [1480]  0.00964 0.00365 −0.47909
Porphyromonas asaccharolytica [1056]  0.01619 5.72e − 0 6 # 0.00044 −0.40618
Fusobacterium nucleatum subsp. nucleatum [1479]  0.01349 0.00044 −0.35996
Unclassified Ruminococcaceae bacterium [u:1580] 1.9081 0.02058 −0.33357
Pseudoflavonifractor capillosus [1579] 0.02118 2.39158 1.26e − 05  # 0.00076 −0.32359
Clostridium hylemonae [1607]  0.01719 0.02176 −0.31261
Prevotella nigrescens [1069] 0.00013 # 0.0049 −0.25569
Porphyromonas uenonis [1057] 0.00945 0.00845 −0.17249
Campylobacter rectus [1720] 0.00893 0.03124 −0.06126
Campylobacter gracilis [1724] 0.01175 0.03025 −0.06126
Leptotrichia hofstadii [1488] 0.01188 0.0381 0
Methanosphaera stadtmanae [90]  0.01664 0.03869 0.30127
Unclassified Ruminococcus [1620]  0.0101 0.02176 0.42972
Unclassified Ruminococcus [1621]  0.00047 0.01219 0.44361
Streptococcus salivarius [1377]  0.01566 3.63078 0.02176 0.51372
Eubacterium hallii [1597] 0.00087 0.01653 0.62221
Eubacterium rectale [1630]  1.90125 0.00048 0.01219 0.66217
Eubacterium ventriosum [1629]  0.0008 0.01622 0.90023
Table 4. Biomarker prediction results obtained on the Alzheimer’s disease (AD) dataset of Ling et al. [44]. In bold are the genera that are found increased in AD patients; in underline are the genera that are found increased in healthy controls. Results are sorted on increasing fold change values. For Boruta, the genera that most significantly contributed to the biomarker signature are provided (not related to a specific condition). Predictions that are based on a model with AU-ROC score < 0.75 are marked with §. Predictions obtained by LEfSe with an LDA score < 3 are marked with #. Genera indicated with   were also identified in the original analysis by Ling et al.
OTU (genus) LASSO Boruta LEfSe DAA Fold change
(mdn rel wt) (Gini ind) (p value) (q value)
Akkermansia 1.15e − 07 3.56e − 08 −1.4734
Enterococcus 0.00859 5.7e − 08 0.00011 −0.99303
Sellimonas 0.01926 3.01e − 05 # 0.00038 −0.91614
Unclassified Clostridiales Family XIII. Incertae Sedis 0.00011 # 0.00116 −0.78301
Lysobacter 0.01745 0.00878 −0.74891
Eggerthella 0.00018 0.00146 −0.70743
Bifidobacterium 8.27538 § 0.00028 0.00225 −0.60623
Pelagibacterium 0.01945 3.46263 § 0.01134 −0.60251
Erysipelatoclostridium 0.01306 3.18431 § 0.02804 −0.28292
Eubacterium −0.02397 0.00013 0.00116 0.41253
Acetivibrio 4.83916 § 0.03658 0.41552
Anaerotaenia 0.00041 # 0.00321 0.59836
Murimonas 0.0005 # 0.00368 0.66313
Parasporobacterium 0.00058 # 0.0041 0.70144
Unclassified Clostridiales −0.00849 0.02175 0.70757
Lutispora 0.00013 # 0.00116 0.72373
Denitrobacterium −0.02897 0.00067 # 0.00459 0.73651
Intestinibacter −0.00945 0.01244 0.75557
Anaerobium −0.01814 0.00012 # 0.00116 0.80947
Coprococcus 8.83e − 09 0.00105 0.83777
Lachnoanaerobaculum 9.52e − 06# 0.00015 0.87453
Falcatimonas 1.79e − 06# 3.99e−09 0.89331
Haemophilus 9.02e − 06# 0.00015 0.8996
Pseudobutyrivibrio −0.01143 2.16e − 05# 0.0003 0.91829
Dialister −0.00994 0.01134 0.96135
Butyricicoccus −0.01305 0.00011 # 0.00116 0.98736
Unclassified Lachnospiraceae −0.00981 2.63e − 06 1.25e−08 1.00242
Roseburia 4.66922 § 1.09e − 06 9.78e−07 1.00381
Romboutsia −0.01967 1.57e − 09 0.00023 1.05482
Fusicatenibacter −0.00950 1.2e − 06 3.56e−08 1.07552
Butyrivibrio −0.12439 2.81e − 08# 1.25e−08 1.08997
Lachnospira −0.00903 6.26e − 07# 1.6e−09 1.09979
Faecalibacterium −0.01020 3.94238 § 7.89e − 03 1.41e−07 1.13233
Table 5. Biomarker prediction results obtained on the Parkinson’s disease (PD) dataset of Qian et al. [48]. In bold are the species that are found increased in PD patients; in underline are the species that are found increased in healthy controls. Results are sorted on increasing fold change values. For Boruta, the species that most significantly contributed to the biomarker signature are provided (not related to a specific condition). Predictions that are based on a model with AU-ROC score < 0.75 are marked with §. Predictions obtained by LEfSe with an LDA score < 3 are marked with #. Biomarkers indicated with ↑ were found increased other (independent) case-control studies included in the review by Tan et al. [49]. Only biomarker Gordonibacter pamelaeae agreed with the original by Qian et al., whereas Megasphaera elsdenii and Akkermansia muciniphila were also identified in one or more other (independent) case-control studies. Of note, none of the 12 species predicted by Qian and collaborators could be confirmed by any of the 30 studies reviewed by Tan et al.
OTU (species) LASSO Boruta LEfSe LEfSe DAA Fold change Qian [48] Bedarf [50] Baldini [51] Cirstea [52] Tan [49]
(mdn rel wt) (Gini ind) (LDA) (p value) (q value)
Megasphaera elsdenii 3.64077 2.96e − 08 0.00385 −0.99623
Ligilactobacillus salivarius 0.01488 § 2.6761 # 0.00036 # 0.02659 −0.89106
Akkermansia muciniphila 3.40865 0.00034 0.02659 −0.80375
Gordonibacter pamelaeae 0.01329 § 0.02145 −0.72737
Gordonibacter urolithinfaciens 0.0151 § 0.0177 −0.68152
Vescimonas coprocola 3.72433 0.0001 0.03147 −0.65045
Acidaminococcus massiliensis 3.06366 0.00027 0.026354 −0.64975
Megasphaera hexanoica 0.01164 § 2.24826 # 0.00016 # 0.0237 −0.62302
Streptococcus gordonii 0.01633 § 0.02145 −0.578
Intestinimonas butyriciproducens 2.98681 # 0.00041 # 0.02659 −0.57421
Slackia isoflavoniconvertens 2.16649 # 0.00039 # 0.02659 −0.5664
Bittarella massiliensis 0.01555 § 2.00191 # 0.00029 # 0.02635 −0.51399
Pusillimonas faecalis 2.53823 # 0.00065 # 0.02966 −0.48288
Fervidicella metallireducens 0.01514 § 0.02966 −0.47635
Angelakisella massiliensis 2.57424 # 0.00017 # 0.0237 −0.4655
Pseudoflavonifractor gallinarum 2.2313 # 0.00022 # 0.02371 −0.42042
Amycolatopsis thailandensis 0.01469 § 0.02371 −0.39584
Clostridium amylolyticum 0.01497 § 0.0237 −0.38074
Pseudoflavonifractor phocaeensis 2.16752 # 0.00081 # 0.02966 −0.37963
Pseudoflavonifractor capillosus 2.39011 # 0.00098 # 0.03147 −0.37271
Paenibacillus mucilaginosus 0.01663 § 0.02145 −0.3601
Desulfovibrio legallii 2.82445 # 0.00041 # 0.02659 −0.30997

2.2. Diversity Analysis

To determine alterations in microbiota diversity between healthy and disease subjects, we performed alpha and beta diversity analyses. Alpha diversity is a local measure that refers to the average species diversity in a specific habitat or area. We measured alpha diversity as the observed richness (number of taxa) or evenness (relative abundances of those taxa) of an average sample within a habitat type. In total, we included three alpha diversity metrices: observed OTUs, Shannon index [53], and Simpson index [54].

Furthermore, we quantified beta diversity, which is defined as the variability in community composition (i.e., the identity of taxa observed) among samples within a habitat. To assess beta diversity, we performed a redundancy analysis (also known as principal components analysis of instrumental variables), a statistical technique designed to relate two sets of variables, where one set is dependent on the other. The aim of redundancy analysis is to maximize the explained variance of the dependent variables through a linear combination of explanatory variables. The principal components of a collection of points in a real coordinate space are defined as a sequence of p unit vectors, wherein the ith vector represents the direction of a line that best fits the data while being orthogonal to the first i–1 vector. In our analysis, the best-fitting line is the one that minimized the average squared distance from the points to the line. These principal component directions form an orthonormal basis in which the individual dimensions of the data are linearly uncorrelated.

2.3. Method Evaluation

2.3.1. Datasets Used for Evaluation

To evaluate our KMB consensus approach and to compare its performance with individual methods, four datasets were used. An overview of these datasets, their sources, and the results from the original studies are summarized in Table 1. Further details about each dataset are provided below.

2.3.1.1. CRC Datasets

The first dataset analyzed was obtained from a study by Kostic et al. [46], which investigated microbiota changes associated with the development of CRC. Using a combination of whole genome sequencing, 16S rRNA sequencing and quantitative PCR data, the authors found that the genus Fusobacterium (most likely the species F. nucleatum) was enriched in tumor samples, whereas the relative abundance of the phyla and Firmicutes was reduced.

For the evaluation of our KMB framework, we reanalyzed the 16S rRNA-based taxonomic profiles obtained from the original study, which included N = 185 subjects with assigned diagnostic attributes (95 CRC patients and 90 controls). This dataset represents one of the largest 16S rRNA gene datasets available for CRC. Although the authors also performed shotgun metagenome sequencing, the analysis was limited to nine tumor/control pairs. We, therefore, decided not to include this dataset in our KMB evaluation, but instead, we utilized the taxonomic abundance profiles generated by Zeller et al. [28], who performed shotgun metagenome sequencing on a substantially larger cohort of N = 141 subjects (53 CRC patients and 88 controls), all resident in France. The CRC group comprised of patients with AJCC stage I-IV tumors, while the control group comprised 61 healthy individuals and 27 individuals with small adenomas (> 1 cm).

In line with findings of Kostic et al., Zeller et al. also observed an increase in the phylum Fusobacteria in CRC patients, along with a depletion of the phylum Firmicutes. On the contrary, Zeller et al. reported an enrichment of the phylum Bacteroidetes in tumor samples. At the species level, Zeller et al. identified 22 microbial markers collectively associated with CRC. Among these, two F. nucleatum subspecies (vincentii and animalis) were highlighted as the most important biomarkers promoting carcinogenesis. However, the inclusion of two additional species, Porphyromonas asaccharolytica and Peptostreptococcus stomatis, was necessary to create a signature capable to differentiating between health and disease. Both Kostic et al. and Zeller et al. did not find significant differences in microbial diversity between control and tumor samples. To evaluate Kostic et al. dataset, we retrieved the taxonomic profiles from the R-package Phyloseq (available at https://bioconductor.org/packages/release/bioc/html/phyloseq.html [55]). To analyze Zeller et al. dataset, we retrieved the taxonomic profiles from the Supporting Information of the original publication [28].

2.3.1.2. AD Dataset

The third dataset used was obtained from a study by Ling et al. [44] which investigated the alterations in gut microbiota in a cohort of Chinese AD patients with N = 171 subjects (100 AD patients and 71 controls). To date, this remains as one of the largest gut-microbiome cohorts available for AD. Using a targeted sequencing approach of the 16S rRNA gene V3-V4 region, the authors identified 24 bacterial genera that were significantly changed between both groups. In particular, the study reported a decreased presence of several butyrate-producing genera, such as Faecalibacterium, Roseburia, Coprococcus, and Butyricococcus, in the feces of AD patients. Concurrently, a significant increase in lactic acid producing bacteria, such as Bifidobacterium and Enterococcus, and propionate-producing bacteria, such as Akkermansia, was observed in the stool samples of AD patients. At the phylum level, Ling et al. and collaborators observed that fecal samples from AD patients were enrichment for the phyla Actinobacteria and Verrucomicrobia, whereas the phylum Firmicutes was depleted. In addition, the study noticed a strong reduction in the overall bacterial diversity of AD-associated fecal microbiota.

To evaluate this dataset with our KMB consensus prediction framework, we retrieved the Illumina MiSeq sequence data of the study from the Sequence Read Archive (SRA accession SRP262626). The paired-end reads were subsampled to 10 Mbp per sample and merged into pseudoreads (based on overlapping nucleotides) using USEARCH Version 9.2 [56]. The resulting pseudoreads were clustered for uniqueness, filtered for chimera sequences, and taxonomically classified with SNAP (Version 1.0.23) through alignment against the RDP database (Version 11.5) [57]. Pseudoreads were clustered into OTUs at 97% similarity using the USEARCH “cluster_otus” function. Reads that could not be classified were discarded and excluded from further analysis.

2.3.1.3. PD Dataset

The fourth evaluated dataset was generated by Qian et al. [48] who investigated the association of gut microbiota with PD in a cohort of N = 80 (40 PD patients and 40 controls). Using a shotgun metagenomic sequencing approach, the authors compared the two groups from both taxonomic and functional perspectives. From the phylogenetic comparative analysis performed using MetaPhlAn [58], Qian et al. found significant differences at the phylum level for Synergistetes, Verrucomicrobia, and “viruses with no name”, as well as at the genus level for Alistipes, Holdemania, Streptococcus, Granulicatella, Gordonibacter, Lactobacillus, and Enterobacter (all of which were increased in PD). In addition, 13 species—10 belonged to the phylum Firmicute s—were found to be enriched in PD patients. The authors further performed metagenomic species (MGS) identification following the method presented by Nielsen et al. [59] and clustered 174.964 genes that were significantly differentially abundant between the controls and PD patients. Of the identified 153 MGS, 40 could be assigned to a specific genus, particularly Bacteroides (14 MGS) and Alistipes (5 MGS). Interestingly, while most of the identified MGS enriched in the healthy controls were associated with Bacteroides (at the genus and species level), 5 of the 14 Bacteroides genus MGS were increased in PD patients. Moreoever of the 5 Bacteroides species MGS, 4 were increased in healthy controls, and 1 was elevated in PD patients. This observation aligns with the general consensus that taxonomic identification at the genus or higher level resolution is often insufficient to explain microbiota dysbiosis, emphasizing the need for analysis at the strain or, at least, at species level resolution [60].

To further evaluate these findings from our KMB framework, we included the results of a comprehensive literature review by Tan et al. [49] of microbiome studies in PD. The authors covered in total 32 case-control studies (including the study of Qian et al.) that were screened, including two shotgun metagenomics sequencing, 28 16S rRNA gene sequencing and 2 Quantitative RT-PCR studies. Across all the studies, Tan et al. identified a total of 102 genera and 44 species potentially associated with PD dysbiosis. However, the overall consistency between these studies was low, with only about one-quarter of the results being replicable. At the genus level, also contradictory results were observed with some studies reporting certain biomarkers as increased in PD while others finding the same biomarkers enriched in healthy controls. Nevertheless, the genera Akkermansia and Bifidobacterium, as well as the species Akkermansia muciniphila, were most consistently predicted as potential biomarkers, showing an increased abundance in PD. In healthy subjects, multiple studies reported an increase of the genera Faecalibacterium and Blautia and the species Akkermansia muciniphila.

Concerning alpha diversity, Qian et al. observed a strong increase in diversity in the gut microbiome of PD patients compared to healthy controls. However, this observation was only confirmed by 9 of the 24 studies reviewed by Tan et al. The majority of other studies (N = 14) did not report significant differences between groups while one study reported a decrease in alpha diversity in PD patients. All studies reviewed by Tan et al. including the study by Qian et al. consistently observed significantly different beta diversity between PD patients and healthy controls.

For our evaluation, we downloaded the Illumina HiSeq X Ten data from the Data Repository for Human Gut Microbiota (Project PRJNA433459). The paired-end sequences were subsampled to 5 GBp per sample and taxonomically classified using Kraken2 (Version 2.1.1 [61]) and Bracken (Version 2.6.0 [62]) software to estimate the species-level abundances. Our reference sequences consisted of bacterial, fungal, archaeal, and viral genomes which were retrieved from the NCBI RefSeq database (release 209; downloaded February 22, 2022). The resulting taxonomic profiles were used as input for the KMB analysis framework.

2.3.2. Performance Assessment

The individual models and the consensus-based KMB model were evaluated based on their ability to accurately predict biomarkers (or biomarker signatures) identified in the original studies and corroborated by other literature. Additionally, we evaluated the statistical significance association with the predictions using the metrics employed by the individual methods.

For the DAA with SIAMCAT, the significance of the log fold changes was assessed using p values and q values (or adjusted p values). These values were derived from two-tailed Z-tests, as described by Cardoso et al. [63] and Wirbel et al. [27]. A significance threshold of p ≤ 0.05 and q ≤ 0.05 was applied. For the LR method (which includes LASSO and ridge regression), we used the median relative feature weights as a performance metric. For the RF model (included in Boruta), we used Gini impurity scores to assess the feature performance. Note that there is no fixed threshold or acceptance criteria for Gini impurity scores; hence, features were ranked in decreasing order of Gini impurity scores. For the LEfSe method, we used LDA scores to assign the importance of taxa (feature) in discriminating between groups, and p values to assess the significance. Note that we used the default settings for the p value cut-off (≤ 0.05) and log LDA score (≥ 2.0).

The overall performance characteristics of the LR (LASSO/elastic net) and RF (Boruta) models were further assessed using the AU-ROC and the area under the precision recall curve (AU-PRC) plots. ROC curves were obtained by plotting the true positive rate (TPR—also known as sensitivity) against the false positive rate (FPR—also known as specificity). While ROC is typically used for plotting the diagnostic ability of a binary classifier, it can also be applied to a multiclass classification problem as found in biological datasets with the help of pairwise comparison analysis. As there is no generic optimal AU-ROC threshold value, we used 0.75 as a cut-off, which is in between 0.5 (random performance of the model) and 1 (perfect model performance). Precision–recall (PR) curves were obtained by plotting the precision (calculated by dividing the number of true positive predictions by the total number of positive predictions) and recall (calculated by dividing the number of true positive predictions by the total number of true positives). Similarly, for PR curves there is no generic optimal threshold value; however, the closer the curve is to the upper right corner, the better the model’s performance.

2.3.3. Confounding Effects

All datasets included in this study were analyzed for potential confounding factors (e.g., patient and other metadata) by the authors of the original studies. To summarize, Kostic et al. [46] observed a minor correlation between patient geographic location and microbial diversity, likely due to the difference in sample collection methods. An association between higher tumor grades and an increased microbial diversity was also noted. Zeller et al. [28] indicated that patient age significantly differed between control and CRC cases. Ling et al. [44] analyzed different clinical indicators and found that the mini mental state examination (MMSE), Weschler adult intelligence scale (WAIS), and Barthel scores were significantly lower in AD patients compared to controls. Finally, Qian et al. [48] found that only PD status influenced the PD Index in their cohort.

Since the potential effect of confounding factors was thoroughly examined by the original authors, no additional confounder analysis was performed in this study.

3. Results and Discussion

For our evaluations, we used published 16S rRNA gene and metagenomics shotgun datasets from four case-control studies investigating associations between microbiome and CRC [28, 46], AD [44], and PD [48], as described in the Materials and Methods section.

For all datasets, we first assessed the relative microbial abundances in the healthy and control samples and subsequently compared the microbial richness between the two groups using different alpha diversity metrices (observed OTUs, Shannon index and Simpson index). Then, we assessed the performance of our consensus-based KMB approach and the individual methods (LR-LASSO/elastic net, RF-Boruta, LEfSe, and DAA) by comparing their biomarker predictions with the original study results and other literature sources. The evaluation parameters include significance thresholds (p value and the adjusted p value or q value), feature importance scores (median relative weight, Gini impurity, and LDA scores) and model accuracy metrics (AU-ROC and AU-PRC). Evaluations of the 16S rRNA gene datasets were performed at genus level, whereas evaluations of the shotgun datasets were performed at species and strain level.

3.1. Evaluations of the CRC 16S rRNA Gene Dataset

3.1.1. Abundance and Diversity Analysis

In line with the findings of Kostic et al. [46], the phyla Bacteroidetes and Firmicutes were significantly more abundant in healthy control samples, whereas the relative abundance of the phylum Fusobacteria was enriched in tumor samples (Figure 2a). Next, we assessed microbial richness in healthy controls and CRC patients using three alpha diversity indices: observed OTUs, Shannon, and Simpson. While Kostic et al. did not find significant differences using cumulative OTU counts and the Chao1 diversity index, our evaluation consistently indicated a significant reduction (p value ≤ 0.05) in microbial diversity in CRC patient samples across all three indices (Figure 2b). Importantly, two other major studies that investigated the association between CRC and gut microbiome [64, 65] also reported decreased bacterial diversity in tumor samples, using the Shannon diversity and evenness indices.

Details are in the caption following the image
Abundance and diversity analysis of the colorectal cancer (CRC) dataset of Kostic et al. [46]. (a) Abundance of the 10 most present phyla in control and disease samples. In line with the findings of Kostic and collaborators, the relative abundance of Bacteroidetes and Firmicutes was enriched in the control samples, whereas the abundance of Fusobacterium was enriched in the disease samples. (b) Alpha diversity measure using three different indices: observed OTUs, Shannon, and Simpson. All indices indicated a significant reduction (p value ≤ 0.05) of the diversity in the disease samples, which is in contrast with the original findings by the authors but in agreement with several other independent case-control studies.
Details are in the caption following the image
Abundance and diversity analysis of the colorectal cancer (CRC) dataset of Kostic et al. [46]. (a) Abundance of the 10 most present phyla in control and disease samples. In line with the findings of Kostic and collaborators, the relative abundance of Bacteroidetes and Firmicutes was enriched in the control samples, whereas the abundance of Fusobacterium was enriched in the disease samples. (b) Alpha diversity measure using three different indices: observed OTUs, Shannon, and Simpson. All indices indicated a significant reduction (p value ≤ 0.05) of the diversity in the disease samples, which is in contrast with the original findings by the authors but in agreement with several other independent case-control studies.

3.1.2. Performance of the KMB Framework

We further performed biomarker predictions on the Kostic et al. dataset using our KMB framework. In total, 13 genera were detected by KMB, 10 of which were also identified by Kostic et al. using LEfSe for their analysis (LDA significance threshold of 4.2) (Table 2). Notably, our KMB framework confirmed the enriched presence of Fusobacterium in CRC-associated metagenomes, a finding validated by quantitative PCR and metagenomic shotgun analyses and corroborated by subsequent studies (particularly involving Fusobacterium nucleatum). This genus ranked among the top three biomarkers in our analysis (LEfSe p = 5.57e − 05; DAA q = 0.00248). Other genera, identified by both KMB and Kostic et al. and frequently associated with CRC in the literature, include Faecalibacterium, Bacteroides, and Alistipes. Our analysis also predicted a significant enrichment of Eubacterium and Campylobacter, which were not reported by Kostic et al. Both genera had significant predictions via LEfSe (p < 0.01) and DAA (q ≤ 0.05) and at least one additional method (LASSO and Boruta). The role of Eubacterium in the proliferation of CRC has drawn considerable attention due to its production of butyrate, a short-chain fatty acid (SCFA) shown to lower tumor progression and with novel therapeutic effects [66]. In line with our results, Zhang et al. [67] reported decreased levels of Eubacterium (p = 0.009) in gut microbiota of healthy subjects. Specific species such as E. hallii [65] and E. rectale (p ≤ 0.05) [68] have been identified as potential CRC “driver” bacteria. Recent in vitro and in vivo studies by Ryu et al. [69] demonstrated anti-CRC effects of E. callanderi KGMB02377, which also contains pathway genes for γ-aminobutyric acid (GABA) synthesis, thus potentially explaining the inhibition of CRC progression. Similarly, studies linked an increased abundance of Campylobacter spp. to CRC proliferation, particularly the oral pathogenic strain C. jejuni in a mouse study [70]. Importantly, C. jejuni produces cytolethal distending toxin (CDT), which has been shown to promote intestinal inflammation.

3.1.3. Performance of the Individual Biomarker Prediction Methods

At the method level, DAA and LEfSe demonstrated a strong and consistent performance agreeing on the top nine biomarkers, albeit with varying order of significance. By contrast, LASSO and Boruta predicted fewer biomarkers (four and six, respectively). Notably, LASSO’s features fell below the AU-ROC threshold (0.68 < 0.75), leading to potential incorrect predictions. For instance, LASSO detected an increased abundance of unclassified Clostridiales genus in tumor samples, whereas both LEfSe and DAA detected its enrichment in healthy controls. Boruta achieved higher accuracy (AU − ROC = 0.89 ≥ 0.75), but its top biomarkers (based on Gini index values) were less significant by LEfSe and DAA (based on p and q values, respectively).

3.2. Evaluations of the CRC Shotgun Metagenomics Dataset

3.2.1. Abundance and Diversity Analysis

In line with the original study by Zeller et al. [28], our KMB analysis confirmed an enrichment of phyla Fusobacteriota (synonym Fusobacteria), Pseudomonodota (synonym Proteobacteria), and Bacteroidota (synonym Bacteroidetes), alongside a depletion of phyla Actinomycetota (synonym Actinobacteria) and Bacillota (synonym Firmicutes) in CRC samples (Supporting Information 1: Figure S1A). These results partially align with Kostic et al., who reported an enrichment of Fusobacterium and depletion of Firmicutes in CRC samples but observed a depletion of the phylum Bacteroidetes, which is in coherence with Zeller et al. and own findings. Alpha diversity analysis (Supporting Information 1: Figure S1B), that is Shannon and Simpson indices, revealed a slight but nonsignificant decrease (p > 0.05) decrease in community diversity in tumor samples, which corresponds with our previous analysis of 16S rRNA taxonomic profiles. However, the observed OTUs index showed a small but significant (p ≤ 0.05) increased richness in tumor samples.

3.2.2. Performance of the KMB Framework

Next, we evaluated the performance of the KMB framework and its individual methods against the original biomarker predictions by Zeller et al. Our approach predicted a total of 26 bacterial species as differentially abundant between healthy and tumor groups (see Table 3). Of these, 15 species were also part of the consensus signature proposed by Zeller et al., which consisted of 22 marker species collectively associated with CRC. Notably, the KMB framework accurately predicted the enriched presence of Fusobacterium nucleatum in tumor samples. The role of this periodontal pathogen in CRC progression has been extensively studied in recent years leading to improved insights into its functional mechanisms. For instance, Zhu et al. [71] recently performed experiments in mice and human subjects, demonstrating the increased ability of F. nucleatum to invade tumor cells and to bind to a specific tumor-expressed protein, DHX15. These findings further support the hypothesis of a distinct interaction between host genotype and gut microbiome in CRC. Furthermore, several bacterial species consistently associated with CRC tumors were confirmed by our KMB framework and are supported by epidemiological evidence as well as mouse model experiments (see [72] for an overview). Amongst these species were Porphyromonas asaccharolytica [73], Peptostreptococcus stomatis [74], and Bacteroides fragilis [75], all of which were predicted by KMB and were included in Zeller et al.’s consensus signature. Conversely Parvimonas micra, known to promote CRC tumorigenesis by upregulating cell-associated cytokines [76], was predicted by the KMB among the biomarkers but was not part of Zeller et al.’s consensus signature. However, Zeller et al. did report a significantly increased abundance of P. micra in CRC patients compared to controls (p = 2.06e − 02)).

3.2.3. Contributions of the Individual Biomarker Prediction Methods to the KMB Model

Regarding the contribution of individual biomarker prediction methods to the consensus model, all 26 predictions were supported by the DAA method. The LASSO method, however, demonstrated the highest overlap with Zeller et al.’s findings, with 12 out of 18 predictions aligning with their results. While this consistency with Zeller et al. is partly expected, given that they used LASSO as well (albeit with a different procedure), it is worth noting that the LASSO model in the KMB framework exceeded the set AU-ROC threshold value (AU-ROC = 0.81 ≥ 0.75). The model’s high performance is also reflected by the high AU-PRC score (Figure 3). These results highlight LASSO’s significant contribution to our consensus approach on the shotgun-based taxonomic profiles. Notably, LASSO and DAA were the only methods to predict the enriched presence of F. nucleatum and B. fragilis in CRC samples. Additionally, both methods identified significant shifts in four other species (Porphyromonas uenonis, Campylobacter rectus, Campylobacter gracilis, and Leptotrichia hofstadii) that were neither reported by the Boruta and LEfSe methods nor were part of the consensus signature of Zeller et al. However, Zeller et al. reported significant difference in abundances (p < 0.01) of L. hofstadii and C. rectus between CRC and healthy individuals.

Details are in the caption following the image
Performance comparison of the LASSO and Boruta model on the colorectal cancer (CRC) dataset of Zeller et al. [28]. (a) AU-ROC plot indicating that both the LASSO and Boruta (random forest) models are above the set threshold (AU-ROC ≥ 0.75). (b) AU-PRC demonstrating a good model performance of both methods (in black Boruta, in gray LASSO).
Details are in the caption following the image
Performance comparison of the LASSO and Boruta model on the colorectal cancer (CRC) dataset of Zeller et al. [28]. (a) AU-ROC plot indicating that both the LASSO and Boruta (random forest) models are above the set threshold (AU-ROC ≥ 0.75). (b) AU-PRC demonstrating a good model performance of both methods (in black Boruta, in gray LASSO).

Interestingly, P. uenonis was found to be significantly decreased in CRC in the recent study by Zhang et al. [77], which analyzed 705 fecal samples across six metagenomic sequencing cohorts with diverse geographical and ethnic backgrounds. Additionally, a meta-transcriptome study by Warren et al. [78] reported coenrichment of Fusobacterium, Lepotrichia, and Campylobacter spp. in CRC tissues. At the species level, the largest number of uniquely mapping sequences corresponded to F. nucleatum, L. hofstadii and C. showae (phenotypically similar to C. rectus). The authors did not find unique sequence matches to C. jejuni, which was also absent in KMB’s predictions.

The LEfSe method demonstrated lower consistency with DAA on shotgun-based profiles compared to 16S rRNA gene-based profiles of the Kostic et al. study. LEfSe detected only 12 biomarkers, 6 of which were found by Zeller et al.’s study. However, 5 of the top 6 LEfSe’s predicted biomarkers (LDA value > 3) corresponded to Zeller et al.’s biomarkers. Similarly, the Boruta method contributed significantly with 8 out of the Top 10 OTUs included in the final KMB selection, 5 of which were also found in Zeller et al.’s study. Unlike LASSO, Boruta accurately predicted the depletion of E. rectale and the enrichment of P. micra in CRC samples. Importantly, the accuracy of Boruta model exceeded the set AU-ROC threshold value (AU-ROC = 0.91 ≥ 0.75) and the high AU-PRC indicates a strong model performance (Figure 3b). These findings highlight the robustness of the method ensemble implemented in the KMB framework to accurately predict microbial signatures associated with disease.

3.3. Evaluations of the AD 16S rRNA Gene Dataset

3.3.1. Abundance and Diversity Analysis

At the phylum level, our taxonomic classifications (abundances) mirrored those reported by Ling et al. [44]. Of the 10 phyla classified by the authors (using a minimum sequence occurrence threshold of > 0.005%), 8 were consistent with the Top 10 most abundant phyla classified in our analysis (Supporting Information 2: Figure S2A). However, direct comparison of exact percentages was not possible due to absence of numerical data in the Ling et al. study. The most abundant phyla were Firmicutes, Proteobacteria, Actinobacteria, Bacteroidetes and Verrucomicrobiae, which aligns with earlier findings by Zhuang et al. [79].

Our analysis additionally detected a significant presence of the phylum Euryarchaeota, which was not found in the Ling et al. study due to the removal of sequences identified as archaea. Similarly, the phylum Streptophyta was absent from the top 10 list of most of Ling et al. study, although our analysis reported its presence at very low abundance. Other phyla such as Candidates, Saccharibacteria and Lentisphaerae were classified as the 11th and 12th most abundant phylum in our study, whereas they were ranked 8th and 10th in the Ling et al. study. At the genus level, our classifications also largely corresponded with those reported by Ling et al., confirming the high inter-individual variation observed by the authors (Figure 4a). While the relative proportions of bacterial genera could not be directly compared (see above), 9 out of the Top 10 most abundant genera in our analysis were included in the top 19 most abundant genera reported by Ling et al. Additionally, our analysis found a high relative abundance of the genus Gemmiger. This is likely attributed to the difference in the reference databases used: Ling et al. relied on the Greengenes database [80], while our classifications relied on the RDP database. The Greengenes database has not been updated since 2013, and recent research by Campos et al. [81] on chicken cecal microbiota composition highlighted potential outdatedness issue with the Greengenes classification of Faecalibacterium, specifically of F. prausnitzii which was reclassified as distinct from the closely related Gemmiger/Subdoligranulum cluster [82]. Consequently, Campos et al. reported that sequences classified as Faecalibacterium using Greengenes were instead classified as Gemmiger with the RDP database. Interestingly, Ling et al. classified Subdoligranulum amongst their Top 19 abundant genera but also identified genus Gemmiger as a biomarker using LEfSe. This finding is unsurprising though, as LEfSe relies on the RDP taxonomy. Overall, these results underscore the significant impact a database can have on analysis outcomes.

Details are in the caption following the image
Diversity analysis of the Alzheimer’s disease (AD) dataset of Ling et al. [44]. (a) Relative abundance bar plot of the 10 most present genera in AD and Control samples. In line with the findings of Ling and collaborators, samples show substantial inter-individual variations. (b) RDA plot showing the beta diversity among the AD patients and control group based on the compositional distribution of the microbiota at genus level.
Details are in the caption following the image
Diversity analysis of the Alzheimer’s disease (AD) dataset of Ling et al. [44]. (a) Relative abundance bar plot of the 10 most present genera in AD and Control samples. In line with the findings of Ling and collaborators, samples show substantial inter-individual variations. (b) RDA plot showing the beta diversity among the AD patients and control group based on the compositional distribution of the microbiota at genus level.

We also assessed species richness using three alpha diversity indices—observed OTU, Shannon, and Simpson—which were also employed by Ling et al. In our analysis, all three indices indicated a reduction in bacterial diversity among AD patients (Supporting Information 2: Figure S2B), corroborating the study results of Ling et al. who also reported higher diversity in control samples. However, none of the indices in our analysis was statistically significantly (p > 0.05), whereas Ling et al. reported significant differences for all three indices (p ≤ 0.05). This discrepancy aligns with a recent review by Zhu et al. [83], which evaluated 14 AD case-control studies, and found that microbial diversity changes were often not significant. To illustrate, eight studies using the Shannon index reported no significant differences between groups, while four studies reported a significant decrease in diversity among AD patients. The beta diversity, in our analysis and in that of Ling et al., identified a significant variability in community composition between controls and AD patients (Figure 4b). This finding aligns with the broader consensus; 12 out of 14 studies in the Zhu et al. review also reported significant differences in beta diversity between these groups.

3.3.2. Performance of the KMB Framework: Enriched Microbial Genera in Healthy Controls

Subsequently, we compared the biomarker predictions of the KMB analysis framework against those of the original study. Consistent with the findings of Ling et al., a substantial number of biomarkers were associated with microbiota dysbiosis in AD (Table 4). At the genus level, KMB predicted 33 biomarkers. Of these, 11 were supported by LEfSe with LDA scores > 3, of which 9 overlapped with the predictions of Ling et al. who applied the same LDA threshold in their LEfSe analysis. Among the overlapping genera enriched in healthy controls were several butyrate-producing bacteria, such as Faecalibacterium, Roseburia, and Coprococcus, all members of the family Lachnospiraceae. Butyrate is a SCFA, produced during the fermentation of fibers and resistant starch, and garnered increasing attention for its role in gut-brain axis communication [84, 85]. Specifically, SCFAs influence two major signaling pathways: binding to G protein-coupled receptors on enteroendocrine cells and inhibition of histone deacetylases. These pathways can directly or indirectly impact effects on the central nervous system functioning. Numerous animal and human studies have demonstrated a reduced abundance of SCFA-producing bacteria in AD patients [79, 86]. Butyrate in particular has been shown to promote gastrointestinal health and to reduce inflammation by inhibiting proinflammatory cytokines [87]. Ling et al. observed a negative correlation between these three butyrate-producing genera and the levels of proinflammatory cytokine TNF-α and chemokine IP-10. The KMP framework identified Fusicatenibacter and an unclassified Lachnospiraceae as biomarkers, supported by LEfSe (LDA scores > 3), LASSO and DAA, though these were not detected in the original study. Notably, Fusicatenbacteria and other members of Lachnospiraceae are also butyrate producers, with evidence of their depleted levels in AD patients reported across multiple studies. For instance, a meta-analysis by Hung et al. [88] encompassing 11 studies (N = 805; with 427 AD patients and 378 controls from United States and Chinese cohorts) and a study by Yildirim et al. [89] (N = 127; with 47 AD patients, 27 mild cognitive impairment patients and 51 nondemended controls) observed a negative association between Fusicatenibacter and AD in a Turkish cohort. A Thai population by Wanapaisan et al. [90] reported enriched presence of Lachnospiracea (p = 0.001) and Fusicatenibacter (p = 0.0007) in healthy controls compared to AD and MCI patients though the cohort size was relatively small (N = 52).

Furthermore, KMB identified additional butyrate-producing bacteria enriched in healthy controls, supported by LEfSe, albeit lower LDA scores between 2 and 3. These include Lachnospira, Butyrivibrio, Pseudobutyvibrio, and Anaerobium (all Lachnospiraceae) and Butyriccicoccus (family Oscillospiraceae). While these genera were identified as significant biomarkers by DAA and LASSO, only Butyriccicoccus was reported in the Ling et al. study. KMB also confirmed a significant depletion of Eubacterium in AD patients (supported by DAA, LASSO, and LEfSe with LDA > 3). Notably, a study by Cattaneo et al. [91] reported a lower abundance of the anti-inflammatory species Eubacterium rectale (p < 0.001) and a reduction of anti-inflammatory cytokine IL-10 in patients with cognitive impairment and brain amyloidosis. A more recent study by Haran et al. [92], with shotgun metagenomics analysis on a cohort of N = 76 (25 AD patients and 51 controls), similarly confirmed the decrease of key butyrate-producing bacteria (two members of Butyrivibrio and three members of Eubacterium, including E. rectale). Interestingly, in patients without dementia, the study observed an increase in butyrate-encoding enzyme genes, correlating with the induction of the anti-inflammatory P-Glycoprotein pathway. However, the positive correlation observed between the pro-inflammatory taxon Escherichia/Shigella observed by Cattaneo et al. could not be confirmed by the KMB analysis, nor by the studies of Ling et al. and Haran et al.

3.3.3. Performance of the KMB Framework: Enriched Microbial Genera in AD Patients

As regards the identification of enriched microbial genera in AD patients, the KMB framework confirmed four genera that were found identified as such by Ling et al. Two of these genera, Akkermansia and Bifidobacterium, are typically regarded as health-promoting microbes and are known butyrate-producers. The authors, however, associated the increased abundance of these genera in AD patients with their capacity to produce lactate and the SCFA propionate. The association of the genus Bifidobacterium to the AD spectrum displays conflicting results in the literature. For instance, Vogt et al. [10] in a study of N = 119 subjects (94 controls and 25 AD patients) identified Bifidobacterium as a driver of gut dysbiosis and found it significantly depleted in AD patients from the United States. Conversely, the meta-analysis by Hung et al., encompassing a combined, larger cohort, found a significantly increased levels of Bifidobacterium in AD patients (p < 0.001). As Ling et al. suggested, it is plausible that different Bifidobacterium species may be involved with different effects on AD pathology. Supporting this, Haran et al., in a shotgun metagenomics sequencing study, observed a decrease in two Bifidobacterium species (B. bifidum and B. longum) and an increase in Akkermansia muciniphila in AD patients. While these findings corroborate Ling et al.’s hypothesis, a substantially larger number of shotgun metagenomics samples would be required for definitive conclusions.

The KMB framework also confirmed the two other genera identified by Ling et al.: Enterococcus and Eggerthella. Both have been implicated by several studies as potential biomarkers for AD diagnosis. For example, Underly et al. [93] demonstrated in an in vitro experiment with rat cell cultures that Enterococcus faecalis could generate early neurofibrillary epitopes, leading to abnormal tau phosphorylation. Tau pathology, along with β-amyloid deposit, is the two major hallmarks of AD. After infecting rat cortical neuron cell cultures with E. faecalis, the authors observed a strong increase in the reactivity of monoclonal antibodies Alz-50 and CP13, which specifically target tau phosphorylation. These findings align with earlier research linking oral health and AD, and the potential role of E. faecalis in the development of chronical periodontal disease [94] and its capacity to translocate to the brain, potentially causing abscess [95]. On the contrary, a recent study by Hou et al. [96] involving a cohort of N = 77 (30 AD patients and 47 healthy controls) indicated higher abundance of Enterococcus in controls (p ≤ 0.05), attributing this to this genus’s ability to produce SCFAs. While most Enterococcus species produce acetate as the main SCFA, butyrate production is species and strain specific. For instance, strain Enterococcus durans M4-5 can produce butyrate [97], whereas Enterococcus faecalis species cannot. However, some E. faecalis strains can produce propionate in addition to acetate. Haran et al. did not find report significant variations in Enterococcus species, but reported three SCFA producing bacteria (Odoribacter splanchnicus, Eubacterium eligens, and Eubacterium rectale) as the most discriminative markers for AD. Importantly, O. splanchnicus primarily produces acetate and propionate, whereas the two Eubacterium species are primarily acetate producers and generally lack the ability to produce butyrate. The KMB framework also confirmed increased levels of Eggerthella lenta in AD patients, a finding supported by Balakrishnan et al. [47], who demonstrated that gavaging mice with E. lenta led to reduced fecal butyrate levels. These observations support Ling et al.’s hypothesis that gut dysbiosis in AD patients is characterized by a shift from butyrate producing to lactate and propionate-producing bacteria, potentially extending to acetate-producing bacteria. However, since the studies of Ling et al. and Hou et al., as well as most other AD gut microbiome studies, were based on 16S rRNA gene V3-V4 sequencing data, species level resolution remained unattainable. Finally, the KMB analysis also identified a significant increase in Erysipelatoclostridium in AD patients, a genus not detected by Ling et al. This genus has recently been proposed as a biomarker for AD by Xi et al. [98], though with contrasting findings (i.e., a significant increase in healthy controls). The genus is hypothesized to influence cytokine IFN-γ levels, potentially impacting AD pathogenesis. While there is limited evidence linking Erysipelatoclostridium dysbiosis to AD, some strains of species E. ramosum (previously Clostridium ramosum) are known to cause invasive infections, particularly in immunocompromised patients [99].

3.3.4. Summarizing Results on the AD Dataset

In summary, the KMB framework confirmed several biomarkers identified by Ling et al., demonstrating an analogous shift from butyrate producers to lactate, propionate as well as acetate producers in AD samples. The KMB framework also identified some interesting novel genera as potential biomarkers, many of which have been reported in other studies, albeit with contrasting findings. These discrepancies are likely to stem from species- and strain-specific characteristics in metabolite-production, underscoring the need to further investigate using shotgun metagenomics sequencing data and larger cohorts.

3.4. Evaluations of the PD Shotgun Metagenomics Dataset

3.4.1. Abundance and Diversity Analysis

At the phylum level, our abundance analysis confirmed the findings of Qian et al. [48] who identified Bacteroidetes, Firmicutes, Proteobacteria, and Actinobacteria as the most abundant bacterial phyla in both the control and disease groups (Supporting Information 3: Figure S3A). Moreover, healthy controls were characterized by a higher abundance of the phylum Bacteroides and lower abundances of phyla Firmicutes, Proteobacteria, and Actinobacteria compared to the PD group, in line with the findings of Qian et al. Furthermore, we observed a high abundance of the viral phylum Uroviricota, with an increased level in the PD group. The original study by Qian et al. also identified an enrichment of viruses in PD patients, although the authors could not assign a more specific taxonomy and only reported an unclassified/no name phylum for viruses. This could be related to the fact that the authors used MetaPhlan ([58]) to assign taxonomies based on a specific set of marker genes identified from whole genomes. Until recently, these genomes included only a limited number of viruses (approximately 3500 viral genomes for MetaPhlan2.0 and MetaPhlan3.0). In contrast, our analysis used Kraken2-Bracken software for taxonomic classification, combined with a more recent export of whole genome sequences from the NCBI RefSeq database including 11,562 viral genomes.

We then measured alpha diversity using the observed OTU richness, Shannon, and Simpson indices. All three indices indicated increased species diversity in PD samples, with significant p values for the observed OTUs and Shannon indices (p = 0.00586 and p = 0.00447, respectively, both ≤ 0.05; Supporting Information 3: Figure S3B). Additionally, Qian et al. found a significantly increased Shannon index in PD patients (p = 0.0084) based on a profile of 1,118,355 gut microbial genes. These findings are consistent with the current consensus in the field. In a recent review by Tan et al. [49], which compared the results of 30 independent case-control studies, 14 studies reported no significant difference, while nine studies observed an increase in the diversity in PD patients. Six other studies did not report on diversity measures, and one study indicated a decreased diversity in PD patients.

3.4.2. Performance of the KMB Framework: Comparison With the Original Study

Next, we predicted microbial markers using our KMB framework and compared the results to the original findings by Qian et al. In total, 22 species were identified by KMB, all of which were enriched in PD patients (Figure 4). Of these, 10 were supported by LASSO, although the model quality was just below the set AU-ROC threshold value (AU-ROC = 0.74 < 0.75). In contrast, the accuracy of the Boruta model was above the threshold value (AU-ROC = 0.89 ≥ 0.75), yet none of the 10 most contributing species predicted by Boruta was found to be significant by another method. As such, the Boruta markers were not included in the KMB list. Furthermore, 15 biomarkers were supported by significant LEfSe predictions, of which 4 had an LDA score > 3. Surprisingly, only one species identified by KMB (Gordonibacter pamelaeae) matched with the 12 biomarkers identified by Qian et al. Additionally, KMB identified a second species from the same genus, G. urolithinfaciens, as a significant biomarker. Both species were supported by DAA and LASSO, although the features of the latter model were just below the significance level. While the specific role of Gordonibacter in the development of PD is not well understood, some negative correlations with PD have been suggested based on the capacity of G. pamelaeae (strain DSM 19378T) and G. urolithinfaciens (strain DSM 27213T) to produce intermediary urolithins. Urolithins are anti-inflammatory molecules produced by specific gut microbes upon the intake of dietary polyphenols [100] and may possibly enhance gut-barrier integrity. Consequently, urolithin-producing metabotypes are generally considered health-promoting bacteria. Moreover, there is some evidence for a neuroprotective role of urolithins, particularly Urolithin A, though the mechanisms of action have mostly been shown in rodent-based research [101]. Nonetheless, Romo-Vaquero [102], who recently investigated the association between urolithin metabotypes, gut dysbiosis, and disease severity in PD patients (N = 169, 52 PD and 117 HC), did not find a significant difference in Gordonibacter abundance between HC and PD. The genus was found to be decreased in Severe PD compared to Mild PD, although at low significance (Gordonibacter had the lowest LDA score of the genera enriched in PD). In contrast, Lubomski et al. [103] found decreased levels of Gordonibacter in PD patients (N = 128, 74 PD and 74 HC) 12 months after initiation of levodopa–carbidopa intestinal gel (LCIG) therapy. Although the number of LCIG patients was small (N = 10), this finding aligns with earlier studies that detected associations between gut dysbiosis in PD and the use of levodopa–carbidopa, the primary medication to treat PD. Importantly, Maini Rekdal et al. [104] and Kessel et al. [105] demonstrated that bacteria harboring tyrosine decarboxylase genes, particularly Lactobacillus and Enterococcus faecalis, decreased the efficacy of levodopa–carbidopa. It may, therefore, be hypothesized that species of the genus Gordonibacter could also interact with levodopa metabolism. To support this hypothesis, Maini Rekdal and collaborators found that a specific strain from the family Eggerthellaceae, Eggerthella lenta A2, is involved in the reduced efficacy of levodopa–carbidopa. The authors showed that E. lenta A2 harbors a molybdenum-dependent dehydroxylase (Dadh) which catalyzes the conversion of dopamine to m-tyramine. Notably, Gordonibacter sp. is part of the same family, and, from a BLASTP search of close family relatives, Maini Rekdal et al. found a homologous dopamine dehydroxylase in two unclassified Gordonibacter strains, An232A and An230, with amino acid identities of 94% and 93%, respectively. However, the same study indicated that two other strains (Gordonibacter sp. 28C and Gordonibacter pamelaeae 3A) were unable to convert dopamine to m-tyramine. Clearly, deeper investigations are required to understand the association between decreased levels of Gordonibacter, PD, and levodopa–carbidopa administration. Nonetheless, current evidence seems to favor a possible role of molybdenum-dependent dehydroxylases, rather than a beneficial effect due to the capacity of Gordonibacter sp. to produce urolithin. Additionally, the KMB framework identified another member of the family Eggerthellaceae, Slackia isoflavoniconvertens, as significantly enriched in PD patients. This strain encodes several molybdopterin-oxidoreductases (UniProt accessions A0A3N0IGL9, A0A3N0IK48, and A0A3N0I7E1), which belong to the same family as Dadh (molybdopterin-containing oxidoreductases). Finally, Qian et al. found a negative correlation between Streptococcus salivarius and levodopa equivalent dose, along with an overall decreased presence in PD patients. This association could, however, not be reproduced by the KMB framework, and this species was not referred in the studies of Maini Rekdal et al. [104] and Kessel et al. [105].

3.4.3. Performance of the KMB Framework: Comparison With Other Shotgun Metagenomics Studies

Next, we extended our comparison to eight other studies included in the review by Tan et al. [49], which investigated dysbiosis at the species level. Of these, only the study by Bedarf et al. [50] used shotgun metagenomics sequencing, while the other seven studies [51, 52, 106110] based their analysis on 16S rRNA gene data, albeit at a higher resolution. As shown in Table 5, the species Akkermansia muciniphila and Megasphaera elsdenii were confirmed by one or more other studies reviewed by Tan et al. [49]. Notably, these species exhibit low fold changes and are supported by both DAA and LEfSe within the KMB framework (with LDA scores > 3). The detected enrichment of Akkermansia muciniphila in PD is consistent with findings from four other biomarker studies [5052, 110]. At the genus level, Akkermansia is also largely associated with PD, as illustrated by Tan et al. [49], who noted that 14 studies reported same result. Although Qian et al. did not identify Akkermansia spp. (nor A. muciniphila) as a significant microbial biomarker based on MetaPhlAn taxonomic analysis, they did find two MGS clusters annotated to the genus Akkermansia and one specifically to the strain A. muciniphila ATCC BAA-835. In summary, Akkermansia spp. and particularly A. muciniphila are considered among the most important biomarkers for PD diagnosis. However, as previously noted, this finding is quite remarkable given that Akkermansia is generally considered a health-promoting bacterium. Particularly, A. muciniphila promotes the integrity of the gut barrier (mucin layer) and is involved in immune response modulation [111, 112]. The beneficial effect of A. muciniphila has been also demonstrated in metabolic disorders such as obesity [113] and diabetes Type 2 [114]. At the same time, increased abundance of Akkermansia spp. has been reported in several neurological diseases, including multiple sclerosis [115] and AD (see above). These findings suggest the potential existence of a common mechanism underlying the progression of various neurological diseases. To support this hypothesis, Duvallet et al. [13] showed that many genera are typically associated with multiple diseases, indicating that involvement of bacteria is often not disease specific. Although several studies link the increased presence of Akkermansia to a modified immune response (Zhai et al. [116]) and constipation (Vandeputte et al. [117]), more investigations are required to understand the exact role of the different Akkermansia (sub)species in the progression of PD and other neurological disorders. Moreover, caution must be taken when interpreting these results, as confounding factors such as drug treatment and stool frequency can introduce specific microbial signatures [118].

Megasphaera elsdenii, which was identified by KMB and Tan et al. [110] as a potential biomarker, is also a commensal human gut microbe, although typically found in low abundance. This species is known to ferment lactate into SFCAs, including butyrate (Shetty et al. [119]), and with this capability, the strain has been used as a probiotic to prevent ruminal acidosis [120]. Another important characteristic of M. elsdenii is its ability to produce high levels of hexanoic acid [121]. Notably, a recent study by Abdik and Çakır [122] predicted hexanoic acid as a candidate metabolite biomarker for PD. Specifically, the authors aligned 13 postmortem PD transcriptome datasets from the substantia nigra (the brain region most affected by PD) against the Human-GEM metabolism model using the popular TIMBR (transcriptionally inferred metabolic biomarker response) algorithm, originally developed by Blais et al. [123]. Based on a cohort of N = 263 (141 PD and 112 HC), the authors found increased production of hexanoic acid (hexanoylcarnitine) in more than 75% of the comparisons performed. In addition to M. elsdenii, the genus Megasphaera spp. also includes other hexanoic-acid producing species, such as M. hexanoica, which was also predicted as a biomarker by our KMB framework. The stronger association of M. elsdenii with PD compared to M. hexanoica, as indicated by the lower p and q values in LEfSe and DAA analyses, may be explained by the fact M. elsdenii produces higher amounts of hexanoic acid and can utilize a broader variety of carbon sources (e.g., sucrose, glucose, maltose, and fructose) than M. hexanoica, which primarily ferments fructose. In contrast, a recent study by Ren et al. [108], included in the review by Tan et al. [49], identified increased levels of the species Megasphaera micronuciformis in healthy controls. However, the LDA score with LEfSe was notably low (close to 2). Moreover, M. micronuciformis is incapable of producing acids from carbon sources, nor can it ferment the nonconventional carbon gluconate [124], a typical characteristic of all Megasphaera species, with the production of gas.

Of the 17 other species that were identified by the KMB framework, six belonged to the family Oscillospiraceae (Vescimonas coprocola, Bittarella massiliensis, and Angelakisella massiliensis and three Pseudoflavonifractor species: P. gallinarum, P. phocaeensis, and P. capillosus). All of these biomarkers were supported by significant DAA and LEfSe predictions. Notably, a meta-analysis study conducted by Romano et al. [125] reanalyzed 10 available 16S rRNA gene-based gut microbiome datasets and found significant increase of various Oscillospiraceae genera and species in PD patients. While relatively little is known about the role of these species in the human gut microbiome, partly due to difficulty to cultivate these bacteria, substantial evidence suggests a link between low BMI and constipation [126], both key indicators in PD. The biomarker identified with the highest LEfSe LDA score was Vescimonas coprocola (LDA = 3.742), a species recently isolated from the human gut [127]. Unsurprisingly, the literature provides limited information about its possible associations with PD, although one possible link might be an increased plasmid abundance in PD patients [128]. The Pseudoflavonifractor genus was found to be elevated in PD patients undergoing LCIG therapy [125] based on a small cohort of N = 31 (21 PD and 10 HC). Despite the small sample size, this study finding suggests a possible link between Pseudoflavonifractor spp. and levodopa metabolism, as previously observed within the members of Eggerthellaceae family. For example, the genome of P. capillosus ARCC 29799 encodes a molybdopterin oxidoreductase enzyme (UniProt accession A6NTH0), which shows a strong sequence similarity to Dadh and is classified as part of the molybdopterin-containing oxidoreductase family. Regarding the other species identified by the KMB framework, while the studies reviewed by Tan et al. [49] did not report specific evidence of their involvement in PD-related gut microbial dysbiosis, the families they belong to (Streptococcaceae, Clostridiaceae, Lactobacillaceae and Desulfovibrionaceae) were found significantly increased in PD patients. An exception was Pusillimonas faecalis, for which an opposite trend was observed by Vascellari et al. [109], who reported depleted levels of the family Alcaligenaceae in PD patients. Of note, P. faecalis was very recently isolated from human feces [127]. For the remaining families, Pseudonocardiaceae, Paenibacillaceae, Acidominococcaceae, Lactobacillaceae, and Eubacteriales incertae sedis, Tan et al. [49] found no literature evidence supporting decrease or increase abundances in PD patients compared to controls. However, as remarked earlier, gut dysbiosis is often characterized by changes of very specific species and/or strains. To facilitate better understanding on the contribution of individual species and strains and the functional mechanisms underlying disease progression, analysis at higher taxonomic resolution datasets (e.g., with shotgun metagenomics) and larger cohorts for statistical rigor are, therefore, recommended.

4. Conclusions

With this study, we introduced and assessed the performance of our KMB prediction framework. The prediction framework combines various ML and statistical methods into a unified strategy. Using four published microbiome datasets from case-control studies of CRC, AD and PD patients, we demonstrated that, in general, our findings were consistent with those of the original studies. These include microbial abundances, diversity, and global biomarker signatures associated with health and disease conditions. In instances where our results deviated from the original studies or indicated different significance levels, our findings are typically more in line with the broader consensus in the field. For example, while Kostic et al. did not observe significant differences in richness between healthy and CRC tumor groups, our analysis observed a reduced diversity in the CRC tumor group. This observation aligns with several studies on independent datasets, including the study by Zeller et al. [28], which was also evaluated here. At lower taxonomic levels, the KMB framework successfully confirmed the most critical microbial drivers of the observed dysbiosis, although it did not reproduce several other biomarkers predicted in the original studies. The level of agreement varied depending on the dataset; however, results were significantly more consistent when using the same taxonomic profiles as those employed in the original studies, compared to reclassification of raw sequence data with updated databases. For the two CRC datasets (where taxonomic profiles were used as input), the KMB framework confirmed most biomarkers found by the original authors. Importantly, the increased presence of Fusobacterium and its subspecies F. nucleatum in CRC samples was accurately detected. This genus and species are consistently associated with disease progression, although their functional mechanisms appear to be primarily strain- and clade-specific [129]. Additionally, we identified a few new biomarkers not reported in the original studies by Kostic et al. [46] and Zeller et al. For example, the genus Eubacterium (increased in controls) and the species Porphorymonas uenonis (increased in CRC patients) were identified with high significance as observed also in several other studies.

On the 16S rRNA gene-based AD dataset from Ling et al. [44], for which we used the raw sequence data as a starting point, our KMB framework accurately predicted the previously observed shift from butyrate-producing bacteria in healthy control samples to lactate- and propionate-producing bacteria in the AD samples. Additionally, our framework identified an increased presence of acetate-producing bacteria in the AD samples, consistent with findings from other studies. At the genus level, increased levels of Akkermansia and Bifidobacterium, along with decreased levels of Faecalibacterium, were identified in AD patient samples. These three genera were the most significant hits in the original study. While Akkermansia and Bifidobacterium are generally considered health-promoting, their association with AD progression is linked to their predominant production of the SCFAs propionate and lactate. Conversely, the increased presence of Faecalibacterium in healthy controls might be related to its anti-inflammatory properties and high butyrate production. In addition, our framework predicted the increased presence of other butyrate producers in control samples, such as Fusicatenibacter and members of the family Lachnospiraceae. Although these were not included in the list of 24 biomarkers identified by Ling et al., they support the overall conclusion that healthy controls are characterized by a greater abundance of butyrate-producing bacteria.

The results from the PD dataset of Qian et al. [48] showed the least consistency with the original study. Only one association was reproducible. This low agreement is likely due to the different methodology—our study used Kraken2-Bracken and a more up-to-date and comprehensive database (RDP) for taxonomic classification of reads. Nonetheless, the increased presence of Gordonibacter pamelaeae in PD samples, found by both strategies, warrants particular attention as this species is generally considered a health-promoting microbe. Earlier studies have linked the (neuro)protective role of this bacteria to its capacity to produce urolithins—microbial metabolites derived from ellagic acid and ellagitannin [100, 101]. However, a recent study by Lubomski et al. [103] observed decreased levels of Gordonibacter spp. in PD patients treated with LCIG, suggesting a potential interaction between these bacteria and PD medication. This aligns with similar findings by Main-Rekdal et al. [104], who reported that another member of the family Eggerthellaceae (Eggerthella lenta A2) influenced medication efficacy through a molybdopterin-containing oxidoreductase. Notably, Gordonibacter pamelaeae and also Slackia isoflavoniconvertens, additional biomarkers identified by the KMB framework, encode a homologous protein with high sequence similarity. While most studies investigating the impact of PD medication on microbiota composition use stool samples for taxonomic analysis, it is worth noting that the small intestine is the primary site of drug absorption. For example, a recent rat study by van Kessel et al. [130] demonstrated a significant effect of PD medication on the microbiota composition and motility of the small intestine. Although in vivo studies are limited due to difficulty in accessing this region, in vitro models of the small intestine, such as the SHIME system [128] and SIFR technology [131], could be valuable for exploring the interactions between the microbiota, ingested medication, and other compounds such as probiotics [132].

Given the limited overlap between our findings and those of Qian et al., we compared the results of our approach with those reported in the review study by Tan et al. [49] that encompassed eight studies reporting species-level biomarkers. This comparison confirmed two other species: Akkermansia muciniphila and Megasphaera elsdenii. Particularly Akkermansia and species A. muciniphila are strongly associated with PD. Moreover, Akkermansia spp. has also been shown to negatively impact other neurological disorders such as AD, as outlined above, suggesting a common mechanism of action. The increased presence of M. elsdenii in PD patients seems to be linked to the production of hexanoic acid, a characteristic of also other Megasphaera spp. species.

Other species-specific associations identified by the KMB framework could not be confirmed in the literature. However, this is unsurprising given the limited availability of shotgun metagenomics study data on PD patients. Notably, Tan et al.’s review included only two such studies, one of which was Qian et al.’s which we used as an evaluation dataset. Furthermore, in line with our findings, Tan et al. observed that in their review of 30 studies, only one-quarter of the reported biomarkers were reproducible across studies. Similarly, Chandra et al. [133] reviewed several studies linking the gut microbiome to AD and found limited consensus on the bacterial taxa alteration in AD patients. A major reason for this limited reproducibility, as noted by these authors, resides in relatively small patient cohorts used for investigations. Additional and larger shotgun metagenomics case-control studies are needed to detect specific microbial species and strains (including viruses and fungi) involved in health and disease. Such studies will support the understanding of the functional mechanisms underlying conditions. Our findings also suggest that bacteria generally considered health-promoting can be positively associated with disease states, indicating that beneficial properties observed in one condition do not necessarily apply universally, as mechanisms of action are often strain specific as underlined by Wallen et al. [134].

We here present a comprehensive consensus-based biomarker prediction framework, which reliably predicts biomarkers and is versatile in working with limited sample size and lower resolution (i.e., amplicon sequencing reads) datasets. From a methodological viewpoint, our consensus framework underscores the added value of combining DAA with ML algorithms. DAA provides an effective starting point for identification of significant hits, which can then be refined further through ML-based selection methods. Among the three ML methods evaluated, LEfSe makes the most substantial contribution to defining the final biomarker signature. LEfSe is currently the most widely used tool for microbiome marker discovery, as evidenced by over 14,000 citations (Google Scholar search, May 6, 2025). The tool combines LDA scores to estimate effect size of features (here: taxonomies) with p values to assess microbial features’ significance. Our analysis confirms the strength of LDA-based method in predicting microbial biomarkers. For some datasets, however, LASSO and Boruta identified crucial biomarkers that LEfSe missed. This was especially evident in the two shotgun metagenomics datasets analyzed in this study. For example, in the CRC dataset from Zeller et al., LEfSe correctly identified only seven biomarkers and mostly with low LDA scores (< 3), while LASSO correctly identified 12 biomarkers, including all four F. nucleatum subspecies. Similarly, in Qian et al.’s PD dataset, LASSO uniquely detected the increased presence of Gordonibacter pamelaceaea in PD patients (AU-ROC: 0.74, just below the threshold of 0.75). Meanwhile, Boruta showed particular utility in the Kostic et al.’s dataset by accurately predicting two genera (Ruminococcus and Alistipes) that were overlooked by both LEfSe and LASSO. The RF model of Boruta overall outperforms the linear regression model of LASSO in terms of accuracy and sensitivity, as highlighted by the AU-ROC and AU-PRC plots. Boruta was able to create an accurate model (AU-ROC ≥ 0.75) for three of the four datasets analyzed in comparison to two for LASSO. Boruta’s predictions were overall less consistent with other methods and the original studies. The limited predictive capacity of Boruta may be due to overfitting of the model, explained by two factors: small sample size and class imbalances. Imbalanced classification occurs when the distribution of classes in the training dataset is unequal while ML algorithms generally assume equal distributions of class. This imbalance classification issue is relatively common in real-life classification, such as in microbiome datasets [135]). Although RF-based methods like Boruta excel to classify microbiome datasets and omics datasets [37], in some instances, linear regression methods like LASSO are less prone to overfitting and less complex, and may so be preferred for smaller datasets. Data augmentation, such as using in silico methods (e.g., TADA data augmentation software) to the training data, could mitigate class imbalance effects and improve the RF model predictions [136]; however, care must be taken to avoid introduction of biases.

In conclusion, no single method consistently performed best across all datasets, indicating that each method’s assumptions and selection criteria are suited to specific data and feature characteristics. Given the advantages but also limitations of statistical and ML methods, consensus-based approaches are increasingly recommended to predict KMBs from metagenomic data. For instance, Nearing et al. [137] compared 14 differential abundance testing methods on 16S rRNA gene datasets, concluding that the variability in the output of individual methods highlights “an alarming reproducibility crisis” in microbiome research. They also found that a consensus-based approach substantially improved the robustness of the biological interpretations. Our results confirm these findings, demonstrating that our consensus-based KMB framework outperforms individual methods in effectively predicting KMBs. Reanalyzing the original studies datasets using our KMB framework revealed novel biomarkers, unidentified originally, suggesting that revisiting of earlier datasets may yield valuable novel biomarkers and insights. The proposed KMB framework is straightforward, using a simple yet effective decision tree. We demonstrated that KMB is a single strategy framework developed by combining DAA and ML methods, with important characteristics and statistical rigor, and requires minimal information other than metadata of the dataset generated from a study.

The adaptability and robustness of the KMB consensus framework promise a valuable contribution for advancement of precision medicine by improving diagnostic tools and comprehensive statistical methods. Indeed, future-proofing ensemble strategies are crucial as new statistical and ML methods are developed. Our KMB framework integrates diverse data types and adapts to new statistical and ML tools. The approach facilitates associations between clinical phenotypes, microbiome functions, and metabolites, ensuring a comprehensive and flexible method for microbiome-wide association studies. With the increasing availability of large-scale or multimodal data, unsupervised and deep learning algorithms may also be included into the biomarker identification framework to capture complex patterns [138]. Indeed, preliminary evidence shows that, by combining various modalities, the identification of biomarker signatures can be empowered [121] and also support the stratification of individuals into disease types [139]. Caution, however, needs to be taken to not overfit these methods to a specific dataset as this may lead to “overoptimism” effects (see the also recent work on this by Ullman et al. [140]). The overfitting issue especially applies to datasets that consist of small sample sizes, which is still typically the case for the majority of microbiome studies. Large and extensive training datasets are in any case essential to justify the inclusion of data-intensive neural network models.

Consensus-based biomarker prediction approaches will, however, not suffice to solve all inconsistencies observed. Notably, the datasets used in microbiome-wide association studies are the results from independent (clinical) trials with different sample numbers, geographies and with their own protocols for sample collection, library preparation and sequencing. To further minimize the inconsistencies, it is essential that cohort sizes increase and that the methods used to generate microbiome datasets (and profiles) become more standardized and robust. This latter aspect is elegantly illustrated by the perspective of Schloss [141], who showed how the lack of reproducible and robust approaches can easily lead to contrasting results. In summary, the use of different technical approaches (i.e. isolation protocols, sequencing technologies or databases) and samplings from different geographical origins compromise the reproducibility of the investigations, as recently outlined by Abdill et al. [142] who evaluated the geographic and technical effects on variation in the human gut microbiome using a large compendium (> 168,000 samples). Although these effects can be partly reduced by large-scale analyses, it, however, still remains crucial that microbiome studies include multiple, complementing, strategies to answer the same research question from different angles. In a field that is driven by rapid technological and methodological advances, the robustness of the experimental setup is, thus, crucial and should be considered for all parts of the analysis. Here, we see an important future challenge (and opportunity) for the microbiome research community, and essential to advance our understanding of mechanistic insights underlying disease progression.

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

E.S.K., Y.G., and R.B. conceptualized the study; Y.G. and M.L.B. conducted the bioinformatics analysis; E.S.K. and W.P. supervised the study; W.P. drafted the manuscript; W.P. and R.B. revised the manuscript text for grammar and consistency, and C.A.-de-L. for biostatistical content. All authors approved the final version of the manuscript.

Funding

This research was supported by BaseClear B.V. and NWO Gravitation: BRAINSCAPES—a roadmap from neurogenetics to neurobiology (grant no. 024.004.012).

Acknowledgments

The authors would like to thank Dr. Solon Pissis for the supervision of the thesis work of Yashjit Gangopadhyay which contributed to the development of the key microbial marker framework and method comparisons.

    Supporting Information

    Additional supporting information can be found online in the Supporting Information section.

    Data Availability Statement

    The sequence data that support the findings of this study are openly available in the Sequence Read Archive at https://www.ncbi.nlm.nih.gov/sra, reference numbers SRP000383 and SRP262626, the European Nucleotide Archive at https://www.ebi.ac.uk/ena/browser/home, reference number ERP005534, and the NCBI BioProject database at https://www.ncbi.nlm.nih.gov/bioproject, reference number PRJNA433459; see also Table 1. No new sequence data was created or analyzed in this study.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.