Volume 2025, Issue 1 9676659

Research Article

Open Access

Improved Key Microbial Biomarker Discovery Using Ensemble Statistical Methods

Walter Pirovano,

Corresponding Author

Walter Pirovano

[email protected]

orcid.org/0000-0002-2717-4735

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Department of Complex Trait Genetics , Centre for Neurogenomics and Cognitive Research , Amsterdam Neuroscience , Vrije Universiteit , Amsterdam , the Netherlands , vu.nl

Search for more papers by this author

Yashjit Gangopadhyay,

Yashjit Gangopadhyay

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author

Mirna Lilian Baak,

Mirna Lilian Baak

Bioinformatics Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author

Christiaan Arie de Leeuw,

Christiaan Arie de Leeuw

orcid.org/0000-0003-1076-9828

Department of Complex Trait Genetics , Centre for Neurogenomics and Cognitive Research , Amsterdam Neuroscience , Vrije Universiteit , Amsterdam , the Netherlands , vu.nl

Search for more papers by this author

Radhika Bongoni,

Radhika Bongoni

orcid.org/0000-0002-7120-1884

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author

Eline Suzanne Klaassens,

Eline Suzanne Klaassens

orcid.org/0000-0002-2190-5285

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author

Walter Pirovano,

Corresponding Author

Walter Pirovano

[email protected]

orcid.org/0000-0002-2717-4735

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Department of Complex Trait Genetics , Centre for Neurogenomics and Cognitive Research , Amsterdam Neuroscience , Vrije Universiteit , Amsterdam , the Netherlands , vu.nl

Search for more papers by this author

Yashjit Gangopadhyay,

Yashjit Gangopadhyay

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author

Mirna Lilian Baak,

Mirna Lilian Baak

Bioinformatics Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author

Christiaan Arie de Leeuw,

Christiaan Arie de Leeuw

orcid.org/0000-0003-1076-9828

Department of Complex Trait Genetics , Centre for Neurogenomics and Cognitive Research , Amsterdam Neuroscience , Vrije Universiteit , Amsterdam , the Netherlands , vu.nl

Search for more papers by this author

Radhika Bongoni,

Radhika Bongoni

orcid.org/0000-0002-7120-1884

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author

Eline Suzanne Klaassens,

Eline Suzanne Klaassens

orcid.org/0000-0002-2190-5285

Product Development Department , BaseClear B.V. , Leiden , the Netherlands

Search for more papers by this author

First published: 20 May 2025

https://doi.org/10.1155/agm3/9676659

Academic Editor: Jiong Yu

Share a link

Email
Wechat
Bluesky

Abstract

In recent years, there has been a growing awareness of the importance of the microbiome in health and disease. Consequently, the number of large microbiome-related clinical trials has also significantly increased. However, advanced biostatistical analysis is required to properly combine microbiome taxonomic abundance data with phenotypical metadata and reliably predict disease states. While differential abundance analysis and machine-learning techniques are widely used to perform such analyses, selecting the best method is not trivial due to the complexity and specific characteristics of both the data and the algorithms. Here, we present a consensus-based key microbial biomarker (KMB) biostatistical analysis framework that links microbial abundance obtained from amplicon-based or shotgun metagenome sequencing with metadata. The framework integrates machine learning (ML) algorithms and statistical methods to determine the most relevant microbial biomarkers and signatures that explain variation in the microbial abundance counts and metadata classes based on predefined metrics. We evaluated the performance of our framework on publicly available case-control datasets of colorectal cancer, Alzheimer’s disease, and Parkinson’s disease and show that, compared to individually run methods, the combined approach is better able to detect KMB species and signatures associated with health and disease conditions. We conclude that our proposed KMB framework provides an innovative and robust strategy that can contribute to the further development of improved diagnostic tools for early disease detection, personalized medicine design, patient stratification, and a better general understanding of the mechanisms behind observed results in pre and postclinical trials.

1. Introduction

The gut microbiome comprises a wealth of bacteria, archaea, viruses, and eukaryotes. Together, these microbiota play a crucial role in protecting the human host against diseases and in maintaining a ‘healthy’ gut microbiome [1]. In recent years, an increasing number of studies have indicated that numerous human diseases and disorders are associated with disruptions in the microbiome, leading to an imbalanced composition. These include colorectal cancer (CRC) [2, 3], inflammatory bowel disease (IBD) [4, 5], hypertension [6], diabetes and obesity [7, 8], Alzheimer’s disease (AD) [9, 10], and Parkinson’s disease (PD) [11, 12]. This phenomenon, known as dysbiosis [13], is typically correlated with individual lifestyle and clinical biomarkers such as diet [14], cholesterol [15], phosphorylated tau [16], and alpha-synuclein [17].

While a healthy gut is often associated with high microbial diversity [18], dysbiosis is characterized by reduced microbial diversity and compositional shifts. Detecting microbial biomarkers indicative of disease states is of crucial importance for the development of novel diagnostic tools [19] and personalized medicine, as some drugs are metabolized by key microbes before absorption [20]. Also, biomarker analysis deepens our understanding of the general mechanisms underlying host-microbiome interactions [21]. However, given the complexity and high dimensionality of microbiome data, advanced algorithms and statistical methods are essential to explore host-microbiome interactions and reliably predict disease states.

In this context, machine learning (ML) algorithms such as random forest (RF), logistic regression (LR) and linear discriminant analysis (LDA) have demonstrated significant value in associating taxonomic features of microbiomes with phenotypical observations [22, 23]. Methods like least absolute shrinkage and selection operator (LASSO) [24], Boruta [25], and linear discriminant effect size analysis (LEfSE) [26] are widely used to discover microbial biomarkers, though each has its strengths and limitations. Importantly, their predictive power depends on factors such as sample size, sequencing data quality, and provided features. The variability in experimental and analysis setups across studies highlights the importance of employing multiple methods of biomarker analysis [23].

Here, we present a novel, consensus-based key microbial biomarker (KMB) prediction framework that integrates microbial taxonomic abundance tables (generated from either amplicon-based or shotgun metagenomics data) with clinical metadata. The approach combines state-of-the-art ML techniques and statistical models with complementing characteristics. In the first step, linear regression with LASSO (elastic net approach), feature selection with Boruta (combined with RF), LEfSe, and differential abundance analysis (DAA) are used to identify an initial set of biomarkers. In the second step, the found markers are filtered and prioritized based on their significance using statistical methods including relative weight, Gini impurity, LDA, p value and q value.

Using four well characterized publicly available datasets (including CRC, AD, and PD - see Table 1), we demonstrate that integrating multiple (and diverse) strategies into a single framework, combined with an effective filtering and selection of candidate biomarkers, results in more consistent and reliable predictions. These predictions are less influenced by the specific dataset and features used. Consequently, the enhanced robustness of identified biomarkers improves our understanding of host-microbiome interactions in health and disease conditions. Furthermore, our findings highlight the clear advantages of shotgun metagenomics data over amplicon-based data to identify biomarkers at higher resolution taxonomies (i.e., species or strain level).

Table 1. Description of the datasets used for the evaluation of the KMB framework. Note that the taxonomic profiles of the Kostic et al. dataset were retrieved from the R package Phyloseq (https://bioconductor.org/packages/release/bioc/html/phyloseq.html, [55]). The taxonomic profiles of the Zeller et al. dataset were retrieved from the Supporting Information of the original publication [28].

Dataset	Disease	Cohort	Sequencing method	Data archive	KMB input
Kostic et al. [46]	Colecteral cancer (CRC)	N = 185 (95 CRC vs. 90 C)	16S rRNA (V3-V5 region); 454 GS FLX Titanium	NIH SRA archive (SRP000383)	Taxonomic profiles
Zeller et al. [28]	Colecteral cancer (CRC)	N = 141 (53 CRC vs. 88 C)	Metagenomics shotgun; Illumina HiSeq 2000/2500	ENA database (ERP005534)	Taxonomic profiles
Ling et al. [44]	Alzheimer’s disease (AD)	N = 171 (100 AD vs. 71 C)	16S rRNA (V3-V4 region); Illumina MiSeq	NIH SRA archive (SRP262626)	Illumina sequences
Qian et al. [48]	Parkinson’s disease (PD)	N = 80 (40 PD vs. 40 C)	Metagenomic shotgun; Illumina X Ten	NCBI BioProject (PRJNA433459)	Illumina sequences

Given that microbial species and strains often have very specific roles in disease progression and medication efficacy, high-resolution genomic data provide substantial insights into underlying functional mechanisms. This is crucial for the development of effective and safe microbiome-based therapies. Indeed, our investigation shows that the beneficial properties of bacteria observed in one condition may not be generalized to others, as their genomes may harbor functions that specifically influence a disease-related mechanism.

Finally, we discuss the effects and contributions of various ML methods evaluated in our study, and outline perspectives for future methodological development in microbiome analysis. We argue that the adoption of more reproducible and robust approaches is essential to advance our understanding of compositional and functional changes associated with health and disease.

2. Materials and Methods

2.1. KMB Analysis Framework

Our KMB analysis framework employs a multistep approach comprising data preprocessing (including filtering, normalization, and transformation), biomarker prediction (using DAA and varied ML methods), and finally the selection of microbial markers associated with specific conditions (typically health and disease). Depending on the input data, predictions can be made at the genus or at the species (or strain) level. In addition to identifying biomarker signatures, the framework also provides fold change values and statistical significance scores, including median relative weight, Gini index, LDA, p value and q value. An overview of the framework is shown in Figure 1.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Flowchart of our key microbial marker (KMB) analysis framework. After preprocessing of the taxonomic abundance counts and metadata, an ensemble of different methods (differential abundance analysis, LASSO (elastic net), Boruta (Random Forest), and LEfSe (LDA)) is used to predict biomarkers. Through a rigorous filtering and prioritization step, including different statistical measures, the final biomarker signature is defined.

2.1.1. Data Preprocessing

In the first step, the raw genus or species abundance counts obtained from 16S rRNA gene amplicon or shotgun metagenomics sequences are converted into relative abundance counts. The counts are then filtered to retain only those with a frequency above 0.001%. Subsequently, the data is transformed into relative abundances based on the total sum per sample. Concurrently, a factor is created from the metadata labels column (e.g., disease or control) using R, enabling the storage of metadata as levels.

2.1.2. DAA (Association Testing)

In the second step, the dependencies between the preprocessed 16S rRNA gene (or shotgun) sequence features and the metadata factors are analyzed through association testing. For this, we utilize the SIAMCAT (statistical inference of associations between microbial communities and host phenotypes) R-toolbox developed by Wirbel et al. [27], building on the earlier work by Zeller et al. [28].

SIAMCAT performs DAA across two classes using the nonparametric Wilcoxon test with multiple hypothesis correction to assess the significance of associations. This analysis yields model metrics such as the p value and the FDR adjusted p value (q value), which can be used to evaluate the importance of selected taxa. Additionally, SIAMCAT calculates a nonparametric effect size measure for features using area under the receiver operating characteristic (AU-ROC) curve scores. For our analyses, we used the check.associations function of the SIAMCAT packages with default parameters.

While numerous differential abundance testing methods exist, we selected SIAMCAT due to its robust statistical approach and its extensive adoption in the field (see [29–31]).

2.1.3. ML Methods

In parallel to the DAA, we applied three state-of-the-art ML methods to predict biomarkers: LASSO, Boruta, and LEfSe. Each method adopts a different approach for feature selection, as detailed below.

2.1.3.1. LASSO

LASSO [24] is a regression analysis method used for feature selection and regularization. Within our KMB framework, we utilized the SIAMCAT implementation of LASSO, which integrates the original LASSO method with ridge regression to form an elastic net.

In brief, we normalized features using the log.clr normalization method, followed by cross-validation using the create.data.split function (with num.folds = 10 and num.resample = 5). Practically, this means that each dataset was randomly split into 10 subsets, followed by 10 tests for which each time nine subsets were used for training and one subset for testing. The whole process was repeated five times. The resulting model was subsequently trained using the “enet” elastic net method (using the train.model function), followed by the extraction of the median relative feature (or operational taxonomic unit (OTU)) weights across the trained models, along with the percentage of models in which each feature is selected.

The key advantage of the elastic net implementation lies in its ability to linearly combine L1 (LASSO) and L2 (ridge) penalties. This penalizes the coefficients of less important features, resulting in more accurate predictions. This is particularly valuable in cases where the number of features (p) exceeds the number of samples (n), that is, p > n. Specifically,

•
LASSO selects at most n features as nonzero, thus effectively reducing the number of features in the model.
•
Ridge regression, on the other hand, minimizes coefficients but does not reduce the feature set, as no coefficient is forced to zero.

Consequently, LASSO is ideal for scenarios where only a limited number of microbial genes or taxa (features) impact the dysbiosis, whereas ridge regression is better suited for cases where many features have a similar impact on the dysbiosis. By combining these approaches, elastic net implementation models are less dependent on the specific characteristics of the dataset studied, enhancing their robustness.

LASSO, either standalone or combined with ridge regression, has been widely and successfully used to measure associations between microbiome composition and metadata (see [29, 32]). Additionally, LASSO was recently combined with sparse canonical correlation analysis to identify host gene-microbiome associations that influence host pathophysiology [33].

2.1.3.2. Boruta

Boruta [25] is a feature selection algorithm that extends to the RF approach developed by Breiman [34]. It was specifically built to address the limitations of RF when the number of feature variables significantly exceeds the number of samples, a common scenario in biological datasets, that is, where a large number of genes or microbial taxa (features) must be associated with a smaller number of samples.

The primary advantage of Boruta over traditional RF feature selection algorithms is its ability to capture all features that may potentially be relevant to the phenotype. Boruta operates as a wrapper around RF, using a built-in randomization function that randomly shuffles the datasets in multiple iterations to generate “shadow features.” These shadow features serve as a baseline for comparison, allowing the algorithm to iteratively remove features proven to be less relevant.

To train the RF model available within the Boruta R package and to evaluate its predictive performance, we followed the approach presented in Gong et al. [35]. This approach facilitates hyperparameter tuning while also selecting features for building the learning model. To summarize, we created a series of random test and training partitions using the createDataPartition function in R. By setting parameter p to 70, a total of 70% of the data (samples) was used for training. The remaining 30% of the data (samples) was used for testing and model performance evaluation. Subsequently, a repeated 5-fold cross-validation was applied for feature selection and selection.

While Boruta relies on Z-scores to quantify the accuracy loss (a measure of feature importance), RF uses the Gini index (also known as Gini impurity) as an indicator for feature importance. The Gini index measures the probability that a randomly selected feature to be incorrectly classified. Gini impurity reaches zero when all records in a group fall into a single category. This property is particularly useful in decision trees for determining the importance of a target variable in classifying the label. Features with a higher mean decrease in Gini are considered more important [36].

One limitation of Boruta is that while it identifies feature importance, it does not directly indicate how features are associated to a specific state (e.g., health or disease). Despite, Boruta is a robust and efficient feature selection method capable of handling high-dimensional datasets. It has been shown to outperform the standard RF approaches in various microbial biomarker classification studies [37, 38].

2.1.3.3. LEfSe

The ML algorithm LEfSe was developed by Segata et al. [26] and is specifically designed for biomarker discovery. It is currently one of the most frequently used methods in the field. In brief, LEfSe identifies microbial features that best explain differences among classes (phenotypes). First, statistically different features between classes are determined using the nonparametric factorial Kruskal–Wallis sum-rank test [39]. Subsequently, the (unpaired) Wilcoxon rank-sum test [40, 41] is applied to assess the biological consistency. Finally, LDA [42] is used to estimate the effect relevance. The method also offers an option to incorporate prior knowledge to constrain the (high) dimensionality of the data. As a result, the importance of the biomarkers found is ranked according to their log-LDA scores, derived from the effect size analysis, and the p values for microbial feature significance. LEfSe was previously validated on different metagenomic microbiome datasets (human, mouse, and environmental) highlighting the general applicability of the method (see [43–45]).

2.1.4. Biomarker Selection

The ensemble of the DAA and the ML approaches such as LASSO, Boruta, and LEfSE generates a list of biomarker features (OTUs). These features are subsequently ranked by their degree of importance (effect sizes) based on the following criteria: associated q value (DAA), median relative weight (LASSO), Gini index (Boruta), and LDA scores and p value (LEfSe). Additionally, the DAA method calculates the log fold-change in taxa. We made use of conventional significance thresholds applied in the field. The chosen p and q value set DAA and LEfSe are standard significance thresholds. In addition, for LEfSe, the LDA score of 2 is the default value of the package. The AU-ROC threshold of 0.75 used for LASSO and Boruta is a frequently used value that indicates an acceptable model. The value is also halfway between 0.5 and 1 which indicate a random and a perfect model, respectively. For models that did not meet the 0.75 threshold, we used the Top 10 hits to still allow for comparison with other methods. The final consensus of important biomarker features is then defined as follows:

●
Select biomarkers found with:
- a.
  DAA (q value ≤ 0.05);
- b.
  LASSO (when AU-ROC ≥ 0.75, select features with percentage ≥ 0.9 and median relative weigth ≥ 0.05; when AU-ROC is between 0.5 and 0.75, select Top 10 most important features and mark with (§) to indicate lower accuracy of the prediction model);
- c.
  Boruta (Top 10 contributing features based on the highest Gini index; when AU-ROC is between 0.5 and 0.75, mark features with (§) to indicate lower accuracy of the prediction model);
- d.
  LEfSe (LDA score ≥ 2 and p value ≤ 0.05 or ≤ 0.001, when number of features is < 50, mark features with LDA score < 3 as less important (#)).
●
Keep biomarker signatures that are supported by DAA and at least one ML method.
●
Prioritize biomarkers based on the consensus between multiple methods and the significance levels. Note that the final list of biomarkers (as also presented in Tables 2, 3, 4, and 5) is sorted by the fold change as this is an interpretable and universal metric.

Table 2. Biomarker prediction results obtained on the colorectal cancer (CRC) dataset of Kostic et al. [46]. In bold are the genera that are found increased in CRC patients; in underline are the genera that are found increased in healthy controls. Results are sorted on increasing fold change values. For Boruta, the genera that most significantly contributed to the biomarker signature are provided (not related to a specific condition). Predictions that are based on a model with AU-ROC score < 0.75 are marked with §. Genera indicated with ^∗ were also identified in the original analysis by Kostic et al.

OTU (genus)	LASSO	Boruta	LEfSe	DAA	Fold change
OTU (genus)	(mdn rel wt)	(Gini ind)	(p value)	(q value)	Fold change
Campylobacter	—	—	0.00048	0.00916	−0.9186
Fusobacterium ^∗	—	—	5.57e − 05	0.00248	−0.55353
Unclassified Clostridiales ^∗	0.01688 §	4.521	—	0.01112	0.25932
Bacteroides ^∗	−0.01397 §	—	—	0.00875	0.28736
Ruminococcus ^∗	—	5.45715	—	0.03190	0.30229
Unclassified Ruminococcaceae ^∗	—	4.5878	0.00026	0.00361	0.32102
Eubacterium	—	4.1848	3.48e − 06	0.01112	0.35281
Faecalibacterium ^∗	—	—	0.00075	0.00047	0.4359
Parabacteroides ^∗	—	—	0.00011	0.00686	0.53516
Alistipes ^∗	—	3.56422	—	0.03167	0.57941
Bilophila	—	3.41084	0.00039	0.04964	0.67633
Unclassified Rikenellaceae ^∗	−0.01402 §	—	0.00072	0.02788	0.71139
Collinsella ^∗	−0.01856 §	—	1.36e − 05	0.00091	1.10529

Table 3. Biomarker prediction results obtained on the colorectal cancer (CRC) dataset of Zeller et al. [28]. In bold are the species/strains that are found increased in CRC patients; in underline are the species/strains that are found increased in healthy controls. Results are sorted on increasing fold change values. For Boruta, the taxa that most significantly contributed to the biomarker signature are provided (not related to a specific condition). Predictions obtained by LEfSe with an LDA score < 3 are marked with #. Taxa indicated with ^∗ were also identified in the original analysis by Zeller et al.

OTU (genus)	LASSO	Boruta	LEfSe	DAA	Fold change
OTU (genus)	(mdn rel wt)	(Gini ind)	(p value)	(q value)	Fold change
Bacteroides fragilis [1090] ^∗	0.0069	—	—	0.03025	−0.89583
Fusobacterium nucleatum subsp. animalis [1481] ^∗	0.01915	2.71736	4.97e − 08 #	7.58e − 03	−0.87419
Peptostreptococcus stomatis [1530] ^∗	0.0182	2.74931	5.94e − 05	0.00301	−0.86725
Clostridium symbiosum [1600] ^∗	0.01231	—	0.00073 #	0.01576	−0.80188
Parvimonas micra [1505]	—	3.04311	—	0.02234	−0.65777
Fusobacterium nucleatum subsp. vincentii [1482] ^∗	0.03107	3.29898	—	1.58e − 04	−0.5973
Unclassified Parvimonas sp. [1507]	0.01161	—	0.00048 #	0.01219	−0.52338
Unclassified Parvimonas sp. [1506]	—	—	0.00071 #	0.01576	−0.51558
Fusobacterium nucleatum subsp. polymorphum [1480] ^∗	0.00964	—	—	0.00365	−0.47909
Porphyromonas asaccharolytica [1056] ^∗	0.01619	—	5.72e − 0 6 #	0.00044	−0.40618
Fusobacterium nucleatum subsp. nucleatum [1479] ^∗	0.01349	—	—	0.00044	−0.35996
Unclassified Ruminococcaceae bacterium [u:1580]	—	1.9081	—	0.02058	−0.33357
Pseudoflavonifractor capillosus [1579]	0.02118	2.39158	1.26e − 05 #	0.00076	−0.32359
Clostridium hylemonae [1607] ^∗	0.01719	—	—	0.02176	−0.31261
Prevotella nigrescens [1069]	—	—	0.00013 #	0.0049	−0.25569
Porphyromonas uenonis [1057]	0.00945	—	—	0.00845	−0.17249
Campylobacter rectus [1720]	0.00893	—	—	0.03124	−0.06126
Campylobacter gracilis [1724]	0.01175	—	—	0.03025	−0.06126
Leptotrichia hofstadii [1488]	0.01188	—	—	0.0381	0
Methanosphaera stadtmanae [90] ^∗	−0.01664	—	—	0.03869	0.30127
Unclassified Ruminococcus [1620] ^∗	−0.0101	—	—	0.02176	0.42972
Unclassified Ruminococcus [1621] ^∗	—	—	0.00047	0.01219	0.44361
Streptococcus salivarius [1377] ^∗	−0.01566	3.63078	—	0.02176	0.51372
Eubacterium hallii [1597]	—	—	0.00087	0.01653	0.62221
Eubacterium rectale [1630] ^∗	—	1.90125	0.00048	0.01219	0.66217
Eubacterium ventriosum [1629] ^∗	—	—	0.0008	0.01622	0.90023

Table 4. Biomarker prediction results obtained on the Alzheimer’s disease (AD) dataset of Ling et al. [44]. In bold are the genera that are found increased in AD patients; in underline are the genera that are found increased in healthy controls. Results are sorted on increasing fold change values. For Boruta, the genera that most significantly contributed to the biomarker signature are provided (not related to a specific condition). Predictions that are based on a model with AU-ROC score < 0.75 are marked with §. Predictions obtained by LEfSe with an LDA score < 3 are marked with #. Genera indicated with ^∗ were also identified in the original analysis by Ling et al.

OTU (genus)	LASSO	Boruta	LEfSe	DAA	Fold change
OTU (genus)	(mdn rel wt)	(Gini ind)	(p value)	(q value)	Fold change
Akkermansia ^∗	—	—	1.15e − 07	3.56e − 08	−1.4734
Enterococcus ^∗	0.00859	—	5.7e − 08	0.00011	−0.99303
Sellimonas	0.01926	—	3.01e − 05 #	0.00038	−0.91614
Unclassified Clostridiales Family XIII. Incertae Sedis	—	—	0.00011 #	0.00116	−0.78301
Lysobacter	0.01745	—	—	0.00878	−0.74891
Eggerthella ^∗	—	—	0.00018	0.00146	−0.70743
Bifidobacterium ^∗	—	8.27538 §	0.00028	0.00225	−0.60623
Pelagibacterium	0.01945	3.46263 §	—	0.01134	−0.60251
Erysipelatoclostridium	0.01306	3.18431 §	—	0.02804	−0.28292
Eubacterium ^∗	−0.02397	—	0.00013	0.00116	0.41253
Acetivibrio	—	4.83916 §	—	0.03658	0.41552
Anaerotaenia	—	—	0.00041 #	0.00321	0.59836
Murimonas	—	—	0.0005 #	0.00368	0.66313
Parasporobacterium	—	—	0.00058 #	0.0041	0.70144
Unclassified Clostridiales	−0.00849	—	—	0.02175	0.70757
Lutispora	—	—	0.00013 #	0.00116	0.72373
Denitrobacterium	−0.02897	—	0.00067 #	0.00459	0.73651
Intestinibacter	−0.00945	—	—	0.01244	0.75557
Anaerobium	−0.01814	—	0.00012 #	0.00116	0.80947
Coprococcus ^∗	—	—	8.83e − 09	0.00105	0.83777
Lachnoanaerobaculum	—	—	9.52e − 06#	0.00015	0.87453
Falcatimonas	—	—	1.79e − 06#	3.99e−09	0.89331
Haemophilus	—	—	9.02e − 06#	0.00015	0.8996
Pseudobutyrivibrio	−0.01143	—	2.16e − 05#	0.0003	0.91829
Dialister ^∗	−0.00994	—	—	0.01134	0.96135
Butyricicoccus ^∗	−0.01305	—	0.00011 #	0.00116	0.98736
Unclassified Lachnospiraceae	−0.00981	—	2.63e − 06	1.25e−08	1.00242
Roseburia ^∗	—	4.66922 §	1.09e − 06	9.78e−07	1.00381
Romboutsia ^∗	−0.01967	—	1.57e − 09	0.00023	1.05482
Fusicatenibacter	−0.00950	—	1.2e − 06	3.56e−08	1.07552
Butyrivibrio	−0.12439	—	2.81e − 08#	1.25e−08	1.08997
Lachnospira	−0.00903	—	6.26e − 07#	1.6e−09	1.09979
Faecalibacterium ^∗	−0.01020	3.94238 §	7.89e − 03	1.41e−07	1.13233

Table 5. Biomarker prediction results obtained on the Parkinson’s disease (PD) dataset of Qian et al. [48]. In bold are the species that are found increased in PD patients; in underline are the species that are found increased in healthy controls. Results are sorted on increasing fold change values. For Boruta, the species that most significantly contributed to the biomarker signature are provided (not related to a specific condition). Predictions that are based on a model with AU-ROC score < 0.75 are marked with §. Predictions obtained by LEfSe with an LDA score < 3 are marked with #. Biomarkers indicated with ↑ were found increased other (independent) case-control studies included in the review by Tan et al. [49]. Only biomarker Gordonibacter pamelaeae agreed with the original by Qian et al., whereas Megasphaera elsdenii and Akkermansia muciniphila were also identified in one or more other (independent) case-control studies. Of note, none of the 12 species predicted by Qian and collaborators could be confirmed by any of the 30 studies reviewed by Tan et al.

OTU (species)	LASSO	Boruta	LEfSe	LEfSe	DAA	Fold change	Qian [48]	Bedarf [50]	Baldini [51]	Cirstea [52]	Tan [49]
OTU (species)	(mdn rel wt)	(Gini ind)	(LDA)	(p value)	(q value)	Fold change	Qian [48]	Bedarf [50]	Baldini [51]	Cirstea [52]	Tan [49]
Megasphaera elsdenii	—	—	3.64077	2.96e − 08	0.00385	−0.99623	—	—	—	—	↑
Ligilactobacillus salivarius	0.01488 §	—	2.6761 #	0.00036 #	0.02659	−0.89106	—	—	—	—	—
Akkermansia muciniphila	—	—	3.40865	0.00034	0.02659	−0.80375	—	↑	↑	↑	↑
Gordonibacter pamelaeae	0.01329 §	—	—	—	0.02145	−0.72737	↑	—	—	—	—
Gordonibacter urolithinfaciens	0.0151 §	—	—	—	0.0177	−0.68152	—	—	—	—	—
Vescimonas coprocola	—	—	3.72433	0.0001	0.03147	−0.65045	—	—	—	—	—
Acidaminococcus massiliensis	—	—	3.06366	0.00027	0.026354	−0.64975	—	—	—	—	—
Megasphaera hexanoica	0.01164 §	—	2.24826 #	0.00016 #	0.0237	−0.62302	—	—	—	—	—
Streptococcus gordonii	0.01633 §	—	—	—	0.02145	−0.578	—	—	—	—	—
Intestinimonas butyriciproducens	—	—	2.98681 #	0.00041 #	0.02659	−0.57421	—	—	—	—	—
Slackia isoflavoniconvertens	—	—	2.16649 #	0.00039 #	0.02659	−0.5664	—	—	—	—	—
Bittarella massiliensis	0.01555 §	—	2.00191 #	0.00029 #	0.02635	−0.51399	—	—	—	—	—
Pusillimonas faecalis	—	—	2.53823 #	0.00065 #	0.02966	−0.48288	—	—	—	—	—
Fervidicella metallireducens	0.01514 §	—	—	—	0.02966	−0.47635	—	—	—	—	—
Angelakisella massiliensis	—	—	2.57424 #	0.00017 #	0.0237	−0.4655	—	—	—	—	—
Pseudoflavonifractor gallinarum	—	—	2.2313 #	0.00022 #	0.02371	−0.42042	—	—	—	—	—
Amycolatopsis thailandensis	0.01469 §	—	—	—	0.02371	−0.39584	—	—	—	—	—
Clostridium amylolyticum	0.01497 §	—	—	—	0.0237	−0.38074	—	—	—	—	—
Pseudoflavonifractor phocaeensis	—	—	2.16752 #	0.00081 #	0.02966	−0.37963	—	—	—	—	—
Pseudoflavonifractor capillosus	—	—	2.39011 #	0.00098 #	0.03147	−0.37271	—	—	—	—	—
Paenibacillus mucilaginosus	0.01663 §	—	—	—	0.02145	−0.3601	—	—	—	—	—
Desulfovibrio legallii	—	—	2.82445 #	0.00041 #	0.02659	−0.30997	—	—	—	—	—

2.2. Diversity Analysis

To determine alterations in microbiota diversity between healthy and disease subjects, we performed alpha and beta diversity analyses. Alpha diversity is a local measure that refers to the average species diversity in a specific habitat or area. We measured alpha diversity as the observed richness (number of taxa) or evenness (relative abundances of those taxa) of an average sample within a habitat type. In total, we included three alpha diversity metrices: observed OTUs, Shannon index [53], and Simpson index [54].

Furthermore, we quantified beta diversity, which is defined as the variability in community composition (i.e., the identity of taxa observed) among samples within a habitat. To assess beta diversity, we performed a redundancy analysis (also known as principal components analysis of instrumental variables), a statistical technique designed to relate two sets of variables, where one set is dependent on the other. The aim of redundancy analysis is to maximize the explained variance of the dependent variables through a linear combination of explanatory variables. The principal components of a collection of points in a real coordinate space are defined as a sequence of p unit vectors, wherein the i^th vector represents the direction of a line that best fits the data while being orthogonal to the first i–1 vector. In our analysis, the best-fitting line is the one that minimized the average squared distance from the points to the line. These principal component directions form an orthonormal basis in which the individual dimensions of the data are linearly uncorrelated.

2.3. Method Evaluation

2.3.1. Datasets Used for Evaluation

To evaluate our KMB consensus approach and to compare its performance with individual methods, four datasets were used. An overview of these datasets, their sources, and the results from the original studies are summarized in Table 1. Further details about each dataset are provided below.

2.3.1.1. CRC Datasets

The first dataset analyzed was obtained from a study by Kostic et al. [46], which investigated microbiota changes associated with the development of CRC. Using a combination of whole genome sequencing, 16S rRNA sequencing and quantitative PCR data, the authors found that the genus Fusobacterium (most likely the species F. nucleatum) was enriched in tumor samples, whereas the relative abundance of the phyla and Firmicutes was reduced.

For the evaluation of our KMB framework, we reanalyzed the 16S rRNA-based taxonomic profiles obtained from the original study, which included N = 185 subjects with assigned diagnostic attributes (95 CRC patients and 90 controls). This dataset represents one of the largest 16S rRNA gene datasets available for CRC. Although the authors also performed shotgun metagenome sequencing, the analysis was limited to nine tumor/control pairs. We, therefore, decided not to include this dataset in our KMB evaluation, but instead, we utilized the taxonomic abundance profiles generated by Zeller et al. [28], who performed shotgun metagenome sequencing on a substantially larger cohort of N = 141 subjects (53 CRC patients and 88 controls), all resident in France. The CRC group comprised of patients with AJCC stage I-IV tumors, while the control group comprised 61 healthy individuals and 27 individuals with small adenomas (> 1 cm).

In line with findings of Kostic et al., Zeller et al. also observed an increase in the phylum Fusobacteria in CRC patients, along with a depletion of the phylum Firmicutes. On the contrary, Zeller et al. reported an enrichment of the phylum Bacteroidetes in tumor samples. At the species level, Zeller et al. identified 22 microbial markers collectively associated with CRC. Among these, two F. nucleatum subspecies (vincentii and animalis) were highlighted as the most important biomarkers promoting carcinogenesis. However, the inclusion of two additional species, Porphyromonas asaccharolytica and Peptostreptococcus stomatis, was necessary to create a signature capable to differentiating between health and disease. Both Kostic et al. and Zeller et al. did not find significant differences in microbial diversity between control and tumor samples. To evaluate Kostic et al. dataset, we retrieved the taxonomic profiles from the R-package Phyloseq (available at https://bioconductor.org/packages/release/bioc/html/phyloseq.html [55]). To analyze Zeller et al. dataset, we retrieved the taxonomic profiles from the Supporting Information of the original publication [28].

2.3.1.2. AD Dataset

The third dataset used was obtained from a study by Ling et al. [44] which investigated the alterations in gut microbiota in a cohort of Chinese AD patients with N = 171 subjects (100 AD patients and 71 controls). To date, this remains as one of the largest gut-microbiome cohorts available for AD. Using a targeted sequencing approach of the 16S rRNA gene V3-V4 region, the authors identified 24 bacterial genera that were significantly changed between both groups. In particular, the study reported a decreased presence of several butyrate-producing genera, such as Faecalibacterium, Roseburia, Coprococcus, and Butyricococcus, in the feces of AD patients. Concurrently, a significant increase in lactic acid producing bacteria, such as Bifidobacterium and Enterococcus, and propionate-producing bacteria, such as Akkermansia, was observed in the stool samples of AD patients. At the phylum level, Ling et al. and collaborators observed that fecal samples from AD patients were enrichment for the phyla Actinobacteria and Verrucomicrobia, whereas the phylum Firmicutes was depleted. In addition, the study noticed a strong reduction in the overall bacterial diversity of AD-associated fecal microbiota.

To evaluate this dataset with our KMB consensus prediction framework, we retrieved the Illumina MiSeq sequence data of the study from the Sequence Read Archive (SRA accession SRP262626). The paired-end reads were subsampled to 10 Mbp per sample and merged into pseudoreads (based on overlapping nucleotides) using USEARCH Version 9.2 [56]. The resulting pseudoreads were clustered for uniqueness, filtered for chimera sequences, and taxonomically classified with SNAP (Version 1.0.23) through alignment against the RDP database (Version 11.5) [57]. Pseudoreads were clustered into OTUs at 97% similarity using the USEARCH “cluster_otus” function. Reads that could not be classified were discarded and excluded from further analysis.

2.3.1.3. PD Dataset

The fourth evaluated dataset was generated by Qian et al. [48] who investigated the association of gut microbiota with PD in a cohort of N = 80 (40 PD patients and 40 controls). Using a shotgun metagenomic sequencing approach, the authors compared the two groups from both taxonomic and functional perspectives. From the phylogenetic comparative analysis performed using MetaPhlAn [58], Qian et al. found significant differences at the phylum level for Synergistetes, Verrucomicrobia, and “viruses with no name”, as well as at the genus level for Alistipes, Holdemania, Streptococcus, Granulicatella, Gordonibacter, Lactobacillus, and Enterobacter (all of which were increased in PD). In addition, 13 species—10 belonged to the phylum Firmicute s—were found to be enriched in PD patients. The authors further performed metagenomic species (MGS) identification following the method presented by Nielsen et al. [59] and clustered 174.964 genes that were significantly differentially abundant between the controls and PD patients. Of the identified 153 MGS, 40 could be assigned to a specific genus, particularly Bacteroides (14 MGS) and Alistipes (5 MGS). Interestingly, while most of the identified MGS enriched in the healthy controls were associated with Bacteroides (at the genus and species level), 5 of the 14 Bacteroides genus MGS were increased in PD patients. Moreoever of the 5 Bacteroides species MGS, 4 were increased in healthy controls, and 1 was elevated in PD patients. This observation aligns with the general consensus that taxonomic identification at the genus or higher level resolution is often insufficient to explain microbiota dysbiosis, emphasizing the need for analysis at the strain or, at least, at species level resolution [60].

To further evaluate these findings from our KMB framework, we included the results of a comprehensive literature review by Tan et al. [49] of microbiome studies in PD. The authors covered in total 32 case-control studies (including the study of Qian et al.) that were screened, including two shotgun metagenomics sequencing, 28 16S rRNA gene sequencing and 2 Quantitative RT-PCR studies. Across all the studies, Tan et al. identified a total of 102 genera and 44 species potentially associated with PD dysbiosis. However, the overall consistency between these studies was low, with only about one-quarter of the results being replicable. At the genus level, also contradictory results were observed with some studies reporting certain biomarkers as increased in PD while others finding the same biomarkers enriched in healthy controls. Nevertheless, the genera Akkermansia and Bifidobacterium, as well as the species Akkermansia muciniphila, were most consistently predicted as potential biomarkers, showing an increased abundance in PD. In healthy subjects, multiple studies reported an increase of the genera Faecalibacterium and Blautia and the species Akkermansia muciniphila.

Concerning alpha diversity, Qian et al. observed a strong increase in diversity in the gut microbiome of PD patients compared to healthy controls. However, this observation was only confirmed by 9 of the 24 studies reviewed by Tan et al. The majority of other studies (N = 14) did not report significant differences between groups while one study reported a decrease in alpha diversity in PD patients. All studies reviewed by Tan et al. including the study by Qian et al. consistently observed significantly different beta diversity between PD patients and healthy controls.

For our evaluation, we downloaded the Illumina HiSeq X Ten data from the Data Repository for Human Gut Microbiota (Project PRJNA433459). The paired-end sequences were subsampled to 5 GBp per sample and taxonomically classified using Kraken2 (Version 2.1.1 [61]) and Bracken (Version 2.6.0 [62]) software to estimate the species-level abundances. Our reference sequences consisted of bacterial, fungal, archaeal, and viral genomes which were retrieved from the NCBI RefSeq database (release 209; downloaded February 22, 2022). The resulting taxonomic profiles were used as input for the KMB analysis framework.

2.3.2. Performance Assessment

The individual models and the consensus-based KMB model were evaluated based on their ability to accurately predict biomarkers (or biomarker signatures) identified in the original studies and corroborated by other literature. Additionally, we evaluated the statistical significance association with the predictions using the metrics employed by the individual methods.

For the DAA with SIAMCAT, the significance of the log fold changes was assessed using p values and q values (or adjusted p values). These values were derived from two-tailed Z-tests, as described by Cardoso et al. [63] and Wirbel et al. [27]. A significance threshold of p ≤ 0.05 and q ≤ 0.05 was applied. For the LR method (which includes LASSO and ridge regression), we used the median relative feature weights as a performance metric. For the RF model (included in Boruta), we used Gini impurity scores to assess the feature performance. Note that there is no fixed threshold or acceptance criteria for Gini impurity scores; hence, features were ranked in decreasing order of Gini impurity scores. For the LEfSe method, we used LDA scores to assign the importance of taxa (feature) in discriminating between groups, and p values to assess the significance. Note that we used the default settings for the p value cut-off (≤ 0.05) and log LDA score (≥ 2.0).

The overall performance characteristics of the LR (LASSO/elastic net) and RF (Boruta) models were further assessed using the AU-ROC and the area under the precision recall curve (AU-PRC) plots. ROC curves were obtained by plotting the true positive rate (TPR—also known as sensitivity) against the false positive rate (FPR—also known as specificity). While ROC is typically used for plotting the diagnostic ability of a binary classifier, it can also be applied to a multiclass classification problem as found in biological datasets with the help of pairwise comparison analysis. As there is no generic optimal AU-ROC threshold value, we used 0.75 as a cut-off, which is in between 0.5 (random performance of the model) and 1 (perfect model performance). Precision–recall (PR) curves were obtained by plotting the precision (calculated by dividing the number of true positive predictions by the total number of positive predictions) and recall (calculated by dividing the number of true positive predictions by the total number of true positives). Similarly, for PR curves there is no generic optimal threshold value; however, the closer the curve is to the upper right corner, the better the model’s performance.

2.3.3. Confounding Effects

All datasets included in this study were analyzed for potential confounding factors (e.g., patient and other metadata) by the authors of the original studies. To summarize, Kostic et al. [46] observed a minor correlation between patient geographic location and microbial diversity, likely due to the difference in sample collection methods. An association between higher tumor grades and an increased microbial diversity was also noted. Zeller et al. [28] indicated that patient age significantly differed between control and CRC cases. Ling et al. [44] analyzed different clinical indicators and found that the mini mental state examination (MMSE), Weschler adult intelligence scale (WAIS), and Barthel scores were significantly lower in AD patients compared to controls. Finally, Qian et al. [48] found that only PD status influenced the PD Index in their cohort.

Since the potential effect of confounding factors was thoroughly examined by the original authors, no additional confounder analysis was performed in this study.

3. Results and Discussion

For our evaluations, we used published 16S rRNA gene and metagenomics shotgun datasets from four case-control studies investigating associations between microbiome and CRC [28, 46], AD [44], and PD [48], as described in the Materials and Methods section.

For all datasets, we first assessed the relative microbial abundances in the healthy and control samples and subsequently compared the microbial richness between the two groups using different alpha diversity metrices (observed OTUs, Shannon index and Simpson index). Then, we assessed the performance of our consensus-based KMB approach and the individual methods (LR-LASSO/elastic net, RF-Boruta, LEfSe, and DAA) by comparing their biomarker predictions with the original study results and other literature sources. The evaluation parameters include significance thresholds (p value and the adjusted p value or q value), feature importance scores (median relative weight, Gini impurity, and LDA scores) and model accuracy metrics (AU-ROC and AU-PRC). Evaluations of the 16S rRNA gene datasets were performed at genus level, whereas evaluations of the shotgun datasets were performed at species and strain level.

3.1. Evaluations of the CRC 16S rRNA Gene Dataset

3.1.1. Abundance and Diversity Analysis

In line with the findings of Kostic et al. [46], the phyla Bacteroidetes and Firmicutes were significantly more abundant in healthy control samples, whereas the relative abundance of the phylum Fusobacteria was enriched in tumor samples (Figure 2a). Next, we assessed microbial richness in healthy controls and CRC patients using three alpha diversity indices: observed OTUs, Shannon, and Simpson. While Kostic et al. did not find significant differences using cumulative OTU counts and the Chao1 diversity index, our evaluation consistently indicated a significant reduction (p value ≤ 0.05) in microbial diversity in CRC patient samples across all three indices (Figure 2b). Importantly, two other major studies that investigated the association between CRC and gut microbiome [64, 65] also reported decreased bacterial diversity in tumor samples, using the Shannon diversity and evenness indices.

3.1.2. Performance of the KMB Framework

We further performed biomarker predictions on the Kostic et al. dataset using our KMB framework. In total, 13 genera were detected by KMB, 10 of which were also identified by Kostic et al. using LEfSe for their analysis (LDA significance threshold of 4.2) (Table 2). Notably, our KMB framework confirmed the enriched presence of Fusobacterium in CRC-associated metagenomes, a finding validated by quantitative PCR and metagenomic shotgun analyses and corroborated by subsequent studies (particularly involving Fusobacterium nucleatum). This genus ranked among the top three biomarkers in our analysis (LEfSe p = 5.57e − 05; DAA q = 0.00248). Other genera, identified by both KMB and Kostic et al. and frequently associated with CRC in the literature, include Faecalibacterium, Bacteroides, and Alistipes. Our analysis also predicted a significant enrichment of Eubacterium and Campylobacter, which were not reported by Kostic et al. Both genera had significant predictions via LEfSe (p < 0.01) and DAA (q ≤ 0.05) and at least one additional method (LASSO and Boruta). The role of Eubacterium in the proliferation of CRC has drawn considerable attention due to its production of butyrate, a short-chain fatty acid (SCFA) shown to lower tumor progression and with novel therapeutic effects [66]. In line with our results, Zhang et al. [67] reported decreased levels of Eubacterium (p = 0.009) in gut microbiota of healthy subjects. Specific species such as E. hallii [65] and E. rectale (p ≤ 0.05) [68] have been identified as potential CRC “driver” bacteria. Recent in vitro and in vivo studies by Ryu et al. [69] demonstrated anti-CRC effects of E. callanderi KGMB02377, which also contains pathway genes for γ-aminobutyric acid (GABA) synthesis, thus potentially explaining the inhibition of CRC progression. Similarly, studies linked an increased abundance of Campylobacter spp. to CRC proliferation, particularly the oral pathogenic strain C. jejuni in a mouse study [70]. Importantly, C. jejuni produces cytolethal distending toxin (CDT), which has been shown to promote intestinal inflammation.

3.1.3. Performance of the Individual Biomarker Prediction Methods

At the method level, DAA and LEfSe demonstrated a strong and consistent performance agreeing on the top nine biomarkers, albeit with varying order of significance. By contrast, LASSO and Boruta predicted fewer biomarkers (four and six, respectively). Notably, LASSO’s features fell below the AU-ROC threshold (0.68 < 0.75), leading to potential incorrect predictions. For instance, LASSO detected an increased abundance of unclassified Clostridiales genus in tumor samples, whereas both LEfSe and DAA detected its enrichment in healthy controls. Boruta achieved higher accuracy (AU − ROC = 0.89 ≥ 0.75), but its top biomarkers (based on Gini index values) were less significant by LEfSe and DAA (based on p and q values, respectively).

3.2. Evaluations of the CRC Shotgun Metagenomics Dataset

3.2.1. Abundance and Diversity Analysis

In line with the original study by Zeller et al. [28], our KMB analysis confirmed an enrichment of phyla Fusobacteriota (synonym Fusobacteria), Pseudomonodota (synonym Proteobacteria), and Bacteroidota (synonym Bacteroidetes), alongside a depletion of phyla Actinomycetota (synonym Actinobacteria) and Bacillota (synonym Firmicutes) in CRC samples (Supporting Information 1: Figure S1A). These results partially align with Kostic et al., who reported an enrichment of Fusobacterium and depletion of Firmicutes in CRC samples but observed a depletion of the phylum Bacteroidetes, which is in coherence with Zeller et al. and own findings. Alpha diversity analysis (Supporting Information 1: Figure S1B), that is Shannon and Simpson indices, revealed a slight but nonsignificant decrease (p > 0.05) decrease in community diversity in tumor samples, which corresponds with our previous analysis of 16S rRNA taxonomic profiles. However, the observed OTUs index showed a small but significant (p ≤ 0.05) increased richness in tumor samples.

3.2.2. Performance of the KMB Framework

Next, we evaluated the performance of the KMB framework and its individual methods against the original biomarker predictions by Zeller et al. Our approach predicted a total of 26 bacterial species as differentially abundant between healthy and tumor groups (see Table 3). Of these, 15 species were also part of the consensus signature proposed by Zeller et al., which consisted of 22 marker species collectively associated with CRC. Notably, the KMB framework accurately predicted the enriched presence of Fusobacterium nucleatum in tumor samples. The role of this periodontal pathogen in CRC progression has been extensively studied in recent years leading to improved insights into its functional mechanisms. For instance, Zhu et al. [71] recently performed experiments in mice and human subjects, demonstrating the increased ability of F. nucleatum to invade tumor cells and to bind to a specific tumor-expressed protein, DHX15. These findings further support the hypothesis of a distinct interaction between host genotype and gut microbiome in CRC. Furthermore, several bacterial species consistently associated with CRC tumors were confirmed by our KMB framework and are supported by epidemiological evidence as well as mouse model experiments (see [72] for an overview). Amongst these species were Porphyromonas asaccharolytica [73], Peptostreptococcus stomatis [74], and Bacteroides fragilis [75], all of which were predicted by KMB and were included in Zeller et al.’s consensus signature. Conversely Parvimonas micra, known to promote CRC tumorigenesis by upregulating cell-associated cytokines [76], was predicted by the KMB among the biomarkers but was not part of Zeller et al.’s consensus signature. However, Zeller et al. did report a significantly increased abundance of P. micra in CRC patients compared to controls (p = 2.06e − 02)).

3.2.3. Contributions of the Individual Biomarker Prediction Methods to the KMB Model

Regarding the contribution of individual biomarker prediction methods to the consensus model, all 26 predictions were supported by the DAA method. The LASSO method, however, demonstrated the highest overlap with Zeller et al.’s findings, with 12 out of 18 predictions aligning with their results. While this consistency with Zeller et al. is partly expected, given that they used LASSO as well (albeit with a different procedure), it is worth noting that the LASSO model in the KMB framework exceeded the set AU-ROC threshold value (AU-ROC = 0.81 ≥ 0.75). The model’s high performance is also reflected by the high AU-PRC score (Figure 3). These results highlight LASSO’s significant contribution to our consensus approach on the shotgun-based taxonomic profiles. Notably, LASSO and DAA were the only methods to predict the enriched presence of F. nucleatum and B. fragilis in CRC samples. Additionally, both methods identified significant shifts in four other species (Porphyromonas uenonis, Campylobacter rectus, Campylobacter gracilis, and Leptotrichia hofstadii) that were neither reported by the Boruta and LEfSe methods nor were part of the consensus signature of Zeller et al. However, Zeller et al. reported significant difference in abundances (p < 0.01) of L. hofstadii and C. rectus between CRC and healthy individuals.

Interestingly, P. uenonis was found to be significantly decreased in CRC in the recent study by Zhang et al. [77], which analyzed 705 fecal samples across six metagenomic sequencing cohorts with diverse geographical and ethnic backgrounds. Additionally, a meta-transcriptome study by Warren et al. [78] reported coenrichment of Fusobacterium, Lepotrichia, and Campylobacter spp. in CRC tissues. At the species level, the largest number of uniquely mapping sequences corresponded to F. nucleatum, L. hofstadii and C. showae (phenotypically similar to C. rectus). The authors did not find unique sequence matches to C. jejuni, which was also absent in KMB’s predictions.

The LEfSe method demonstrated lower consistency with DAA on shotgun-based profiles compared to 16S rRNA gene-based profiles of the Kostic et al. study. LEfSe detected only 12 biomarkers, 6 of which were found by Zeller et al.’s study. However, 5 of the top 6 LEfSe’s predicted biomarkers (LDA value > 3) corresponded to Zeller et al.’s biomarkers. Similarly, the Boruta method contributed significantly with 8 out of the Top 10 OTUs included in the final KMB selection, 5 of which were also found in Zeller et al.’s study. Unlike LASSO, Boruta accurately predicted the depletion of E. rectale and the enrichment of P. micra in CRC samples. Importantly, the accuracy of Boruta model exceeded the set AU-ROC threshold value (AU-ROC = 0.91 ≥ 0.75) and the high AU-PRC indicates a strong model performance (Figure 3b). These findings highlight the robustness of the method ensemble implemented in the KMB framework to accurately predict microbial signatures associated with disease.

3.3. Evaluations of the AD 16S rRNA Gene Dataset

3.3.1. Abundance and Diversity Analysis

At the phylum level, our taxonomic classifications (abundances) mirrored those reported by Ling et al. [44]. Of the 10 phyla classified by the authors (using a minimum sequence occurrence threshold of > 0.005%), 8 were consistent with the Top 10 most abundant phyla classified in our analysis (Supporting Information 2: Figure S2A). However, direct comparison of exact percentages was not possible due to absence of numerical data in the Ling et al. study. The most abundant phyla were Firmicutes, Proteobacteria, Actinobacteria, Bacteroidetes and Verrucomicrobiae, which aligns with earlier findings by Zhuang et al. [79].

Our analysis additionally detected a significant presence of the phylum Euryarchaeota, which was not found in the Ling et al. study due to the removal of sequences identified as archaea. Similarly, the phylum Streptophyta was absent from the top 10 list of most of Ling et al. study, although our analysis reported its presence at very low abundance. Other phyla such as Candidates, Saccharibacteria and Lentisphaerae were classified as the 11th and 12th most abundant phylum in our study, whereas they were ranked 8th and 10th in the Ling et al. study. At the genus level, our classifications also largely corresponded with those reported by Ling et al., confirming the high inter-individual variation observed by the authors (Figure 4a). While the relative proportions of bacterial genera could not be directly compared (see above), 9 out of the Top 10 most abundant genera in our analysis were included in the top 19 most abundant genera reported by Ling et al. Additionally, our analysis found a high relative abundance of the genus Gemmiger. This is likely attributed to the difference in the reference databases used: Ling et al. relied on the Greengenes database [80], while our classifications relied on the RDP database. The Greengenes database has not been updated since 2013, and recent research by Campos et al. [81] on chicken cecal microbiota composition highlighted potential outdatedness issue with the Greengenes classification of Faecalibacterium, specifically of F. prausnitzii which was reclassified as distinct from the closely related Gemmiger/Subdoligranulum cluster [82]. Consequently, Campos et al. reported that sequences classified as Faecalibacterium using Greengenes were instead classified as Gemmiger with the RDP database. Interestingly, Ling et al. classified Subdoligranulum amongst their Top 19 abundant genera but also identified genus Gemmiger as a biomarker using LEfSe. This finding is unsurprising though, as LEfSe relies on the RDP taxonomy. Overall, these results underscore the significant impact a database can have on analysis outcomes.

We also assessed species richness using three alpha diversity indices—observed OTU, Shannon, and Simpson—which were also employed by Ling et al. In our analysis, all three indices indicated a reduction in bacterial diversity among AD patients (Supporting Information 2: Figure S2B), corroborating the study results of Ling et al. who also reported higher diversity in control samples. However, none of the indices in our analysis was statistically significantly (p > 0.05), whereas Ling et al. reported significant differences for all three indices (p ≤ 0.05). This discrepancy aligns with a recent review by Zhu et al. [83], which evaluated 14 AD case-control studies, and found that microbial diversity changes were often not significant. To illustrate, eight studies using the Shannon index reported no significant differences between groups, while four studies reported a significant decrease in diversity among AD patients. The beta diversity, in our analysis and in that of Ling et al., identified a significant variability in community composition between controls and AD patients (Figure 4b). This finding aligns with the broader consensus; 12 out of 14 studies in the Zhu et al. review also reported significant differences in beta diversity between these groups.

3.3.2. Performance of the KMB Framework: Enriched Microbial Genera in Healthy Controls

Subsequently, we compared the biomarker predictions of the KMB analysis framework against those of the original study. Consistent with the findings of Ling et al., a substantial number of biomarkers were associated with microbiota dysbiosis in AD (Table 4). At the genus level, KMB predicted 33 biomarkers. Of these, 11 were supported by LEfSe with LDA scores > 3, of which 9 overlapped with the predictions of Ling et al. who applied the same LDA threshold in their LEfSe analysis. Among the overlapping genera enriched in healthy controls were several butyrate-producing bacteria, such as Faecalibacterium, Roseburia, and Coprococcus, all members of the family Lachnospiraceae. Butyrate is a SCFA, produced during the fermentation of fibers and resistant starch, and garnered increasing attention for its role in gut-brain axis communication [84, 85]. Specifically, SCFAs influence two major signaling pathways: binding to G protein-coupled receptors on enteroendocrine cells and inhibition of histone deacetylases. These pathways can directly or indirectly impact effects on the central nervous system functioning. Numerous animal and human studies have demonstrated a reduced abundance of SCFA-producing bacteria in AD patients [79, 86]. Butyrate in particular has been shown to promote gastrointestinal health and to reduce inflammation by inhibiting proinflammatory cytokines [87]. Ling et al. observed a negative correlation between these three butyrate-producing genera and the levels of proinflammatory cytokine TNF-α and chemokine IP-10. The KMP framework identified Fusicatenibacter and an unclassified Lachnospiraceae as biomarkers, supported by LEfSe (LDA scores > 3), LASSO and DAA, though these were not detected in the original study. Notably, Fusicatenbacteria and other members of Lachnospiraceae are also butyrate producers, with evidence of their depleted levels in AD patients reported across multiple studies. For instance, a meta-analysis by Hung et al. [88] encompassing 11 studies (N = 805; with 427 AD patients and 378 controls from United States and Chinese cohorts) and a study by Yildirim et al. [89] (N = 127; with 47 AD patients, 27 mild cognitive impairment patients and 51 nondemended controls) observed a negative association between Fusicatenibacter and AD in a Turkish cohort. A Thai population by Wanapaisan et al. [90] reported enriched presence of Lachnospiracea (p = 0.001) and Fusicatenibacter (p = 0.0007) in healthy controls compared to AD and MCI patients though the cohort size was relatively small (N = 52).

Furthermore, KMB identified additional butyrate-producing bacteria enriched in healthy controls, supported by LEfSe, albeit lower LDA scores between 2 and 3. These include Lachnospira, Butyrivibrio, Pseudobutyvibrio, and Anaerobium (all Lachnospiraceae) and Butyriccicoccus (family Oscillospiraceae). While these genera were identified as significant biomarkers by DAA and LASSO, only Butyriccicoccus was reported in the Ling et al. study. KMB also confirmed a significant depletion of Eubacterium in AD patients (supported by DAA, LASSO, and LEfSe with LDA > 3). Notably, a study by Cattaneo et al. [91] reported a lower abundance of the anti-inflammatory species Eubacterium rectale (p < 0.001) and a reduction of anti-inflammatory cytokine IL-10 in patients with cognitive impairment and brain amyloidosis. A more recent study by Haran et al. [92], with shotgun metagenomics analysis on a cohort of N = 76 (25 AD patients and 51 controls), similarly confirmed the decrease of key butyrate-producing bacteria (two members of Butyrivibrio and three members of Eubacterium, including E. rectale). Interestingly, in patients without dementia, the study observed an increase in butyrate-encoding enzyme genes, correlating with the induction of the anti-inflammatory P-Glycoprotein pathway. However, the positive correlation observed between the pro-inflammatory taxon Escherichia/Shigella observed by Cattaneo et al. could not be confirmed by the KMB analysis, nor by the studies of Ling et al. and Haran et al.

3.3.3. Performance of the KMB Framework: Enriched Microbial Genera in AD Patients

As regards the identification of enriched microbial genera in AD patients, the KMB framework confirmed four genera that were found identified as such by Ling et al. Two of these genera, Akkermansia and Bifidobacterium, are typically regarded as health-promoting microbes and are known butyrate-producers. The authors, however, associated the increased abundance of these genera in AD patients with their capacity to produce lactate and the SCFA propionate. The association of the genus Bifidobacterium to the AD spectrum displays conflicting results in the literature. For instance, Vogt et al. [10] in a study of N = 119 subjects (94 controls and 25 AD patients) identified Bifidobacterium as a driver of gut dysbiosis and found it significantly depleted in AD patients from the United States. Conversely, the meta-analysis by Hung et al., encompassing a combined, larger cohort, found a significantly increased levels of Bifidobacterium in AD patients (p < 0.001). As Ling et al. suggested, it is plausible that different Bifidobacterium species may be involved with different effects on AD pathology. Supporting this, Haran et al., in a shotgun metagenomics sequencing study, observed a decrease in two Bifidobacterium species (B. bifidum and B. longum) and an increase in Akkermansia muciniphila in AD patients. While these findings corroborate Ling et al.’s hypothesis, a substantially larger number of shotgun metagenomics samples would be required for definitive conclusions.

The KMB framework also confirmed the two other genera identified by Ling et al.: Enterococcus and Eggerthella. Both have been implicated by several studies as potential biomarkers for AD diagnosis. For example, Underly et al. [93] demonstrated in an in vitro experiment with rat cell cultures that Enterococcus faecalis could generate early neurofibrillary epitopes, leading to abnormal tau phosphorylation. Tau pathology, along with β-amyloid deposit, is the two major hallmarks of AD. After infecting rat cortical neuron cell cultures with E. faecalis, the authors observed a strong increase in the reactivity of monoclonal antibodies Alz-50 and CP13, which specifically target tau phosphorylation. These findings align with earlier research linking oral health and AD, and the potential role of E. faecalis in the development of chronical periodontal disease [94] and its capacity to translocate to the brain, potentially causing abscess [95]. On the contrary, a recent study by Hou et al. [96] involving a cohort of N = 77 (30 AD patients and 47 healthy controls) indicated higher abundance of Enterococcus in controls (p ≤ 0.05), attributing this to this genus’s ability to produce SCFAs. While most Enterococcus species produce acetate as the main SCFA, butyrate production is species and strain specific. For instance, strain Enterococcus durans M4-5 can produce butyrate [97], whereas Enterococcus faecalis species cannot. However, some E. faecalis strains can produce propionate in addition to acetate. Haran et al. did not find report significant variations in Enterococcus species, but reported three SCFA producing bacteria (Odoribacter splanchnicus, Eubacterium eligens, and Eubacterium rectale) as the most discriminative markers for AD. Importantly, O. splanchnicus primarily produces acetate and propionate, whereas the two Eubacterium species are primarily acetate producers and generally lack the ability to produce butyrate. The KMB framework also confirmed increased levels of Eggerthella lenta in AD patients, a finding supported by Balakrishnan et al. [47], who demonstrated that gavaging mice with E. lenta led to reduced fecal butyrate levels. These observations support Ling et al.’s hypothesis that gut dysbiosis in AD patients is characterized by a shift from butyrate producing to lactate and propionate-producing bacteria, potentially extending to acetate-producing bacteria. However, since the studies of Ling et al. and Hou et al., as well as most other AD gut microbiome studies, were based on 16S rRNA gene V3-V4 sequencing data, species level resolution remained unattainable. Finally, the KMB analysis also identified a significant increase in Erysipelatoclostridium in AD patients, a genus not detected by Ling et al. This genus has recently been proposed as a biomarker for AD by Xi et al. [98], though with contrasting findings (i.e., a significant increase in healthy controls). The genus is hypothesized to influence cytokine IFN-γ levels, potentially impacting AD pathogenesis. While there is limited evidence linking Erysipelatoclostridium dysbiosis to AD, some strains of species E. ramosum (previously Clostridium ramosum) are known to cause invasive infections, particularly in immunocompromised patients [99].

3.3.4. Summarizing Results on the AD Dataset

In summary, the KMB framework confirmed several biomarkers identified by Ling et al., demonstrating an analogous shift from butyrate producers to lactate, propionate as well as acetate producers in AD samples. The KMB framework also identified some interesting novel genera as potential biomarkers, many of which have been reported in other studies, albeit with contrasting findings. These discrepancies are likely to stem from species- and strain-specific characteristics in metabolite-production, underscoring the need to further investigate using shotgun metagenomics sequencing data and larger cohorts.

3.4. Evaluations of the PD Shotgun Metagenomics Dataset

3.4.1. Abundance and Diversity Analysis

At the phylum level, our abundance analysis confirmed the findings of Qian et al. [48] who identified Bacteroidetes, Firmicutes, Proteobacteria, and Actinobacteria as the most abundant bacterial phyla in both the control and disease groups (Supporting Information 3: Figure S3A). Moreover, healthy controls were characterized by a higher abundance of the phylum Bacteroides and lower abundances of phyla Firmicutes, Proteobacteria, and Actinobacteria compared to the PD group, in line with the findings of Qian et al. Furthermore, we observed a high abundance of the viral phylum Uroviricota, with an increased level in the PD group. The original study by Qian et al. also identified an enrichment of viruses in PD patients, although the authors could not assign a more specific taxonomy and only reported an unclassified/no name phylum for viruses. This could be related to the fact that the authors used MetaPhlan ([58]) to assign taxonomies based on a specific set of marker genes identified from whole genomes. Until recently, these genomes included only a limited number of viruses (approximately 3500 viral genomes for MetaPhlan2.0 and MetaPhlan3.0). In contrast, our analysis used Kraken2-Bracken software for taxonomic classification, combined with a more recent export of whole genome sequences from the NCBI RefSeq database including 11,562 viral genomes.

We then measured alpha diversity using the observed OTU richness, Shannon, and Simpson indices. All three indices indicated increased species diversity in PD samples, with significant p values for the observed OTUs and Shannon indices (p = 0.00586 and p = 0.00447, respectively, both ≤ 0.05; Supporting Information 3: Figure S3B). Additionally, Qian et al. found a significantly increased Shannon index in PD patients (p = 0.0084) based on a profile of 1,118,355 gut microbial genes. These findings are consistent with the current consensus in the field. In a recent review by Tan et al. [49], which compared the results of 30 independent case-control studies, 14 studies reported no significant difference, while nine studies observed an increase in the diversity in PD patients. Six other studies did not report on diversity measures, and one study indicated a decreased diversity in PD patients.

3.4.2. Performance of the KMB Framework: Comparison With the Original Study

Next, we predicted microbial markers using our KMB framework and compared the results to the original findings by Qian et al. In total, 22 species were identified by KMB, all of which were enriched in PD patients (Figure 4). Of these, 10 were supported by LASSO, although the model quality was just below the set AU-ROC threshold value (AU-ROC = 0.74 < 0.75). In contrast, the accuracy of the Boruta model was above the threshold value (AU-ROC = 0.89 ≥ 0.75), yet none of the 10 most contributing species predicted by Boruta was found to be significant by another method. As such, the Boruta markers were not included in the KMB list. Furthermore, 15 biomarkers were supported by significant LEfSe predictions, of which 4 had an LDA score > 3. Surprisingly, only one species identified by KMB (Gordonibacter pamelaeae) matched with the 12 biomarkers identified by Qian et al. Additionally, KMB identified a second species from the same genus, G. urolithinfaciens, as a significant biomarker. Both species were supported by DAA and LASSO, although the features of the latter model were just below the significance level. While the specific role of Gordonibacter in the development of PD is not well understood, some negative correlations with PD have been suggested based on the capacity of G. pamelaeae (strain DSM 19378^T) and G. urolithinfaciens (strain DSM 27213^T) to produce intermediary urolithins. Urolithins are anti-inflammatory molecules produced by specific gut microbes upon the intake of dietary polyphenols [100] and may possibly enhance gut-barrier integrity. Consequently, urolithin-producing metabotypes are generally considered health-promoting bacteria. Moreover, there is some evidence for a neuroprotective role of urolithins, particularly Urolithin A, though the mechanisms of action have mostly been shown in rodent-based research [101]. Nonetheless, Romo-Vaquero [102], who recently investigated the association between urolithin metabotypes, gut dysbiosis, and disease severity in PD patients (N = 169, 52 PD and 117 HC), did not find a significant difference in Gordonibacter abundance between HC and PD. The genus was found to be decreased in Severe PD compared to Mild PD, although at low significance (Gordonibacter had the lowest LDA score of the genera enriched in PD). In contrast, Lubomski et al. [103] found decreased levels of Gordonibacter in PD patients (N = 128, 74 PD and 74 HC) 12 months after initiation of levodopa–carbidopa intestinal gel (LCIG) therapy. Although the number of LCIG patients was small (N = 10), this finding aligns with earlier studies that detected associations between gut dysbiosis in PD and the use of levodopa–carbidopa, the primary medication to treat PD. Importantly, Maini Rekdal et al. [104] and Kessel et al. [105] demonstrated that bacteria harboring tyrosine decarboxylase genes, particularly Lactobacillus and Enterococcus faecalis, decreased the efficacy of levodopa–carbidopa. It may, therefore, be hypothesized that species of the genus Gordonibacter could also interact with levodopa metabolism. To support this hypothesis, Maini Rekdal and collaborators found that a specific strain from the family Eggerthellaceae, Eggerthella lenta A2, is involved in the reduced efficacy of levodopa–carbidopa. The authors showed that E. lenta A2 harbors a molybdenum-dependent dehydroxylase (Dadh) which catalyzes the conversion of dopamine to m-tyramine. Notably, Gordonibacter sp. is part of the same family, and, from a BLASTP search of close family relatives, Maini Rekdal et al. found a homologous dopamine dehydroxylase in two unclassified Gordonibacter strains, An232A and An230, with amino acid identities of 94% and 93%, respectively. However, the same study indicated that two other strains (Gordonibacter sp. 28C and Gordonibacter pamelaeae 3A) were unable to convert dopamine to m-tyramine. Clearly, deeper investigations are required to understand the association between decreased levels of Gordonibacter, PD, and levodopa–carbidopa administration. Nonetheless, current evidence seems to favor a possible role of molybdenum-dependent dehydroxylases, rather than a beneficial effect due to the capacity of Gordonibacter sp. to produce urolithin. Additionally, the KMB framework identified another member of the family Eggerthellaceae, Slackia isoflavoniconvertens, as significantly enriched in PD patients. This strain encodes several molybdopterin-oxidoreductases (UniProt accessions A0A3N0IGL9, A0A3N0IK48, and A0A3N0I7E1), which belong to the same family as Dadh (molybdopterin-containing oxidoreductases). Finally, Qian et al. found a negative correlation between Streptococcus salivarius and levodopa equivalent dose, along with an overall decreased presence in PD patients. This association could, however, not be reproduced by the KMB framework, and this species was not referred in the studies of Maini Rekdal et al. [104] and Kessel et al. [105].

3.4.3. Performance of the KMB Framework: Comparison With Other Shotgun Metagenomics Studies

Next, we extended our comparison to eight other studies included in the review by Tan et al. [49], which investigated dysbiosis at the species level. Of these, only the study by Bedarf et al. [50] used shotgun metagenomics sequencing, while the other seven studies [51, 52, 106–110] based their analysis on 16S rRNA gene data, albeit at a higher resolution. As shown in Table 5, the species Akkermansia muciniphila and Megasphaera elsdenii were confirmed by one or more other studies reviewed by Tan et al. [49]. Notably, these species exhibit low fold changes and are supported by both DAA and LEfSe within the KMB framework (with LDA scores > 3). The detected enrichment of Akkermansia muciniphila in PD is consistent with findings from four other biomarker studies [50–52, 110]. At the genus level, Akkermansia is also largely associated with PD, as illustrated by Tan et al. [49], who noted that 14 studies reported same result. Although Qian et al. did not identify Akkermansia spp. (nor A. muciniphila) as a significant microbial biomarker based on MetaPhlAn taxonomic analysis, they did find two MGS clusters annotated to the genus Akkermansia and one specifically to the strain A. muciniphila ATCC BAA-835. In summary, Akkermansia spp. and particularly A. muciniphila are considered among the most important biomarkers for PD diagnosis. However, as previously noted, this finding is quite remarkable given that Akkermansia is generally considered a health-promoting bacterium. Particularly, A. muciniphila promotes the integrity of the gut barrier (mucin layer) and is involved in immune response modulation [111, 112]. The beneficial effect of A. muciniphila has been also demonstrated in metabolic disorders such as obesity [113] and diabetes Type 2 [114]. At the same time, increased abundance of Akkermansia spp. has been reported in several neurological diseases, including multiple sclerosis [115] and AD (see above). These findings suggest the potential existence of a common mechanism underlying the progression of various neurological diseases. To support this hypothesis, Duvallet et al. [13] showed that many genera are typically associated with multiple diseases, indicating that involvement of bacteria is often not disease specific. Although several studies link the increased presence of Akkermansia to a modified immune response (Zhai et al. [116]) and constipation (Vandeputte et al. [117]), more investigations are required to understand the exact role of the different Akkermansia (sub)species in the progression of PD and other neurological disorders. Moreover, caution must be taken when interpreting these results, as confounding factors such as drug treatment and stool frequency can introduce specific microbial signatures [118].

Megasphaera elsdenii, which was identified by KMB and Tan et al. [110] as a potential biomarker, is also a commensal human gut microbe, although typically found in low abundance. This species is known to ferment lactate into SFCAs, including butyrate (Shetty et al. [119]), and with this capability, the strain has been used as a probiotic to prevent ruminal acidosis [120]. Another important characteristic of M. elsdenii is its ability to produce high levels of hexanoic acid [121]. Notably, a recent study by Abdik and Çakır [122] predicted hexanoic acid as a candidate metabolite biomarker for PD. Specifically, the authors aligned 13 postmortem PD transcriptome datasets from the substantia nigra (the brain region most affected by PD) against the Human-GEM metabolism model using the popular TIMBR (transcriptionally inferred metabolic biomarker response) algorithm, originally developed by Blais et al. [123]. Based on a cohort of N = 263 (141 PD and 112 HC), the authors found increased production of hexanoic acid (hexanoylcarnitine) in more than 75% of the comparisons performed. In addition to M. elsdenii, the genus Megasphaera spp. also includes other hexanoic-acid producing species, such as M. hexanoica, which was also predicted as a biomarker by our KMB framework. The stronger association of M. elsdenii with PD compared to M. hexanoica, as indicated by the lower p and q values in LEfSe and DAA analyses, may be explained by the fact M. elsdenii produces higher amounts of hexanoic acid and can utilize a broader variety of carbon sources (e.g., sucrose, glucose, maltose, and fructose) than M. hexanoica, which primarily ferments fructose. In contrast, a recent study by Ren et al. [108], included in the review by Tan et al. [49], identified increased levels of the species Megasphaera micronuciformis in healthy controls. However, the LDA score with LEfSe was notably low (close to 2). Moreover, M. micronuciformis is incapable of producing acids from carbon sources, nor can it ferment the nonconventional carbon gluconate [124], a typical characteristic of all Megasphaera species, with the production of gas.

Of the 17 other species that were identified by the KMB framework, six belonged to the family Oscillospiraceae (Vescimonas coprocola, Bittarella massiliensis, and Angelakisella massiliensis and three Pseudoflavonifractor species: P. gallinarum, P. phocaeensis, and P. capillosus). All of these biomarkers were supported by significant DAA and LEfSe predictions. Notably, a meta-analysis study conducted by Romano et al. [125] reanalyzed 10 available 16S rRNA gene-based gut microbiome datasets and found significant increase of various Oscillospiraceae genera and species in PD patients. While relatively little is known about the role of these species in the human gut microbiome, partly due to difficulty to cultivate these bacteria, substantial evidence suggests a link between low BMI and constipation [126], both key indicators in PD. The biomarker identified with the highest LEfSe LDA score was Vescimonas coprocola (LDA = 3.742), a species recently isolated from the human gut [127]. Unsurprisingly, the literature provides limited information about its possible associations with PD, although one possible link might be an increased plasmid abundance in PD patients [128]. The Pseudoflavonifractor genus was found to be elevated in PD patients undergoing LCIG therapy [125] based on a small cohort of N = 31 (21 PD and 10 HC). Despite the small sample size, this study finding suggests a possible link between Pseudoflavonifractor spp. and levodopa metabolism, as previously observed within the members of Eggerthellaceae family. For example, the genome of P. capillosus ARCC 29799 encodes a molybdopterin oxidoreductase enzyme (UniProt accession A6NTH0), which shows a strong sequence similarity to Dadh and is classified as part of the molybdopterin-containing oxidoreductase family. Regarding the other species identified by the KMB framework, while the studies reviewed by Tan et al. [49] did not report specific evidence of their involvement in PD-related gut microbial dysbiosis, the families they belong to (Streptococcaceae, Clostridiaceae, Lactobacillaceae and Desulfovibrionaceae) were found significantly increased in PD patients. An exception was Pusillimonas faecalis, for which an opposite trend was observed by Vascellari et al. [109], who reported depleted levels of the family Alcaligenaceae in PD patients. Of note, P. faecalis was very recently isolated from human feces [127]. For the remaining families, Pseudonocardiaceae, Paenibacillaceae, Acidominococcaceae, Lactobacillaceae, and Eubacteriales incertae sedis, Tan et al. [49] found no literature evidence supporting decrease or increase abundances in PD patients compared to controls. However, as remarked earlier, gut dysbiosis is often characterized by changes of very specific species and/or strains. To facilitate better understanding on the contribution of individual species and strains and the functional mechanisms underlying disease progression, analysis at higher taxonomic resolution datasets (e.g., with shotgun metagenomics) and larger cohorts for statistical rigor are, therefore, recommended.

4. Conclusions

With this study, we introduced and assessed the performance of our KMB prediction framework. The prediction framework combines various ML and statistical methods into a unified strategy. Using four published microbiome datasets from case-control studies of CRC, AD and PD patients, we demonstrated that, in general, our findings were consistent with those of the original studies. These include microbial abundances, diversity, and global biomarker signatures associated with health and disease conditions. In instances where our results deviated from the original studies or indicated different significance levels, our findings are typically more in line with the broader consensus in the field. For example, while Kostic et al. did not observe significant differences in richness between healthy and CRC tumor groups, our analysis observed a reduced diversity in the CRC tumor group. This observation aligns with several studies on independent datasets, including the study by Zeller et al. [28], which was also evaluated here. At lower taxonomic levels, the KMB framework successfully confirmed the most critical microbial drivers of the observed dysbiosis, although it did not reproduce several other biomarkers predicted in the original studies. The level of agreement varied depending on the dataset; however, results were significantly more consistent when using the same taxonomic profiles as those employed in the original studies, compared to reclassification of raw sequence data with updated databases. For the two CRC datasets (where taxonomic profiles were used as input), the KMB framework confirmed most biomarkers found by the original authors. Importantly, the increased presence of Fusobacterium and its subspecies F. nucleatum in CRC samples was accurately detected. This genus and species are consistently associated with disease progression, although their functional mechanisms appear to be primarily strain- and clade-specific [129]. Additionally, we identified a few new biomarkers not reported in the original studies by Kostic et al. [46] and Zeller et al. For example, the genus Eubacterium (increased in controls) and the species Porphorymonas uenonis (increased in CRC patients) were identified with high significance as observed also in several other studies.

On the 16S rRNA gene-based AD dataset from Ling et al. [44], for which we used the raw sequence data as a starting point, our KMB framework accurately predicted the previously observed shift from butyrate-producing bacteria in healthy control samples to lactate- and propionate-producing bacteria in the AD samples. Additionally, our framework identified an increased presence of acetate-producing bacteria in the AD samples, consistent with findings from other studies. At the genus level, increased levels of Akkermansia and Bifidobacterium, along with decreased levels of Faecalibacterium, were identified in AD patient samples. These three genera were the most significant hits in the original study. While Akkermansia and Bifidobacterium are generally considered health-promoting, their association with AD progression is linked to their predominant production of the SCFAs propionate and lactate. Conversely, the increased presence of Faecalibacterium in healthy controls might be related to its anti-inflammatory properties and high butyrate production. In addition, our framework predicted the increased presence of other butyrate producers in control samples, such as Fusicatenibacter and members of the family Lachnospiraceae. Although these were not included in the list of 24 biomarkers identified by Ling et al., they support the overall conclusion that healthy controls are characterized by a greater abundance of butyrate-producing bacteria.

The results from the PD dataset of Qian et al. [48] showed the least consistency with the original study. Only one association was reproducible. This low agreement is likely due to the different methodology—our study used Kraken2-Bracken and a more up-to-date and comprehensive database (RDP) for taxonomic classification of reads. Nonetheless, the increased presence of Gordonibacter pamelaeae in PD samples, found by both strategies, warrants particular attention as this species is generally considered a health-promoting microbe. Earlier studies have linked the (neuro)protective role of this bacteria to its capacity to produce urolithins—microbial metabolites derived from ellagic acid and ellagitannin [100, 101]. However, a recent study by Lubomski et al. [103] observed decreased levels of Gordonibacter spp. in PD patients treated with LCIG, suggesting a potential interaction between these bacteria and PD medication. This aligns with similar findings by Main-Rekdal et al. [104], who reported that another member of the family Eggerthellaceae (Eggerthella lenta A2) influenced medication efficacy through a molybdopterin-containing oxidoreductase. Notably, Gordonibacter pamelaeae and also Slackia isoflavoniconvertens, additional biomarkers identified by the KMB framework, encode a homologous protein with high sequence similarity. While most studies investigating the impact of PD medication on microbiota composition use stool samples for taxonomic analysis, it is worth noting that the small intestine is the primary site of drug absorption. For example, a recent rat study by van Kessel et al. [130] demonstrated a significant effect of PD medication on the microbiota composition and motility of the small intestine. Although in vivo studies are limited due to difficulty in accessing this region, in vitro models of the small intestine, such as the SHIME system [128] and SIFR technology [131], could be valuable for exploring the interactions between the microbiota, ingested medication, and other compounds such as probiotics [132].

Given the limited overlap between our findings and those of Qian et al., we compared the results of our approach with those reported in the review study by Tan et al. [49] that encompassed eight studies reporting species-level biomarkers. This comparison confirmed two other species: Akkermansia muciniphila and Megasphaera elsdenii. Particularly Akkermansia and species A. muciniphila are strongly associated with PD. Moreover, Akkermansia spp. has also been shown to negatively impact other neurological disorders such as AD, as outlined above, suggesting a common mechanism of action. The increased presence of M. elsdenii in PD patients seems to be linked to the production of hexanoic acid, a characteristic of also other Megasphaera spp. species.

Other species-specific associations identified by the KMB framework could not be confirmed in the literature. However, this is unsurprising given the limited availability of shotgun metagenomics study data on PD patients. Notably, Tan et al.’s review included only two such studies, one of which was Qian et al.’s which we used as an evaluation dataset. Furthermore, in line with our findings, Tan et al. observed that in their review of 30 studies, only one-quarter of the reported biomarkers were reproducible across studies. Similarly, Chandra et al. [133] reviewed several studies linking the gut microbiome to AD and found limited consensus on the bacterial taxa alteration in AD patients. A major reason for this limited reproducibility, as noted by these authors, resides in relatively small patient cohorts used for investigations. Additional and larger shotgun metagenomics case-control studies are needed to detect specific microbial species and strains (including viruses and fungi) involved in health and disease. Such studies will support the understanding of the functional mechanisms underlying conditions. Our findings also suggest that bacteria generally considered health-promoting can be positively associated with disease states, indicating that beneficial properties observed in one condition do not necessarily apply universally, as mechanisms of action are often strain specific as underlined by Wallen et al. [134].

We here present a comprehensive consensus-based biomarker prediction framework, which reliably predicts biomarkers and is versatile in working with limited sample size and lower resolution (i.e., amplicon sequencing reads) datasets. From a methodological viewpoint, our consensus framework underscores the added value of combining DAA with ML algorithms. DAA provides an effective starting point for identification of significant hits, which can then be refined further through ML-based selection methods. Among the three ML methods evaluated, LEfSe makes the most substantial contribution to defining the final biomarker signature. LEfSe is currently the most widely used tool for microbiome marker discovery, as evidenced by over 14,000 citations (Google Scholar search, May 6, 2025). The tool combines LDA scores to estimate effect size of features (here: taxonomies) with p values to assess microbial features’ significance. Our analysis confirms the strength of LDA-based method in predicting microbial biomarkers. For some datasets, however, LASSO and Boruta identified crucial biomarkers that LEfSe missed. This was especially evident in the two shotgun metagenomics datasets analyzed in this study. For example, in the CRC dataset from Zeller et al., LEfSe correctly identified only seven biomarkers and mostly with low LDA scores (< 3), while LASSO correctly identified 12 biomarkers, including all four F. nucleatum subspecies. Similarly, in Qian et al.’s PD dataset, LASSO uniquely detected the increased presence of Gordonibacter pamelaceaea in PD patients (AU-ROC: 0.74, just below the threshold of 0.75). Meanwhile, Boruta showed particular utility in the Kostic et al.’s dataset by accurately predicting two genera (Ruminococcus and Alistipes) that were overlooked by both LEfSe and LASSO. The RF model of Boruta overall outperforms the linear regression model of LASSO in terms of accuracy and sensitivity, as highlighted by the AU-ROC and AU-PRC plots. Boruta was able to create an accurate model (AU-ROC ≥ 0.75) for three of the four datasets analyzed in comparison to two for LASSO. Boruta’s predictions were overall less consistent with other methods and the original studies. The limited predictive capacity of Boruta may be due to overfitting of the model, explained by two factors: small sample size and class imbalances. Imbalanced classification occurs when the distribution of classes in the training dataset is unequal while ML algorithms generally assume equal distributions of class. This imbalance classification issue is relatively common in real-life classification, such as in microbiome datasets [135]). Although RF-based methods like Boruta excel to classify microbiome datasets and omics datasets [37], in some instances, linear regression methods like LASSO are less prone to overfitting and less complex, and may so be preferred for smaller datasets. Data augmentation, such as using in silico methods (e.g., TADA data augmentation software) to the training data, could mitigate class imbalance effects and improve the RF model predictions [136]; however, care must be taken to avoid introduction of biases.

In conclusion, no single method consistently performed best across all datasets, indicating that each method’s assumptions and selection criteria are suited to specific data and feature characteristics. Given the advantages but also limitations of statistical and ML methods, consensus-based approaches are increasingly recommended to predict KMBs from metagenomic data. For instance, Nearing et al. [137] compared 14 differential abundance testing methods on 16S rRNA gene datasets, concluding that the variability in the output of individual methods highlights “an alarming reproducibility crisis” in microbiome research. They also found that a consensus-based approach substantially improved the robustness of the biological interpretations. Our results confirm these findings, demonstrating that our consensus-based KMB framework outperforms individual methods in effectively predicting KMBs. Reanalyzing the original studies datasets using our KMB framework revealed novel biomarkers, unidentified originally, suggesting that revisiting of earlier datasets may yield valuable novel biomarkers and insights. The proposed KMB framework is straightforward, using a simple yet effective decision tree. We demonstrated that KMB is a single strategy framework developed by combining DAA and ML methods, with important characteristics and statistical rigor, and requires minimal information other than metadata of the dataset generated from a study.

The adaptability and robustness of the KMB consensus framework promise a valuable contribution for advancement of precision medicine by improving diagnostic tools and comprehensive statistical methods. Indeed, future-proofing ensemble strategies are crucial as new statistical and ML methods are developed. Our KMB framework integrates diverse data types and adapts to new statistical and ML tools. The approach facilitates associations between clinical phenotypes, microbiome functions, and metabolites, ensuring a comprehensive and flexible method for microbiome-wide association studies. With the increasing availability of large-scale or multimodal data, unsupervised and deep learning algorithms may also be included into the biomarker identification framework to capture complex patterns [138]. Indeed, preliminary evidence shows that, by combining various modalities, the identification of biomarker signatures can be empowered [121] and also support the stratification of individuals into disease types [139]. Caution, however, needs to be taken to not overfit these methods to a specific dataset as this may lead to “overoptimism” effects (see the also recent work on this by Ullman et al. [140]). The overfitting issue especially applies to datasets that consist of small sample sizes, which is still typically the case for the majority of microbiome studies. Large and extensive training datasets are in any case essential to justify the inclusion of data-intensive neural network models.

Consensus-based biomarker prediction approaches will, however, not suffice to solve all inconsistencies observed. Notably, the datasets used in microbiome-wide association studies are the results from independent (clinical) trials with different sample numbers, geographies and with their own protocols for sample collection, library preparation and sequencing. To further minimize the inconsistencies, it is essential that cohort sizes increase and that the methods used to generate microbiome datasets (and profiles) become more standardized and robust. This latter aspect is elegantly illustrated by the perspective of Schloss [141], who showed how the lack of reproducible and robust approaches can easily lead to contrasting results. In summary, the use of different technical approaches (i.e. isolation protocols, sequencing technologies or databases) and samplings from different geographical origins compromise the reproducibility of the investigations, as recently outlined by Abdill et al. [142] who evaluated the geographic and technical effects on variation in the human gut microbiome using a large compendium (> 168,000 samples). Although these effects can be partly reduced by large-scale analyses, it, however, still remains crucial that microbiome studies include multiple, complementing, strategies to answer the same research question from different angles. In a field that is driven by rapid technological and methodological advances, the robustness of the experimental setup is, thus, crucial and should be considered for all parts of the analysis. Here, we see an important future challenge (and opportunity) for the microbiome research community, and essential to advance our understanding of mechanistic insights underlying disease progression.

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

E.S.K., Y.G., and R.B. conceptualized the study; Y.G. and M.L.B. conducted the bioinformatics analysis; E.S.K. and W.P. supervised the study; W.P. drafted the manuscript; W.P. and R.B. revised the manuscript text for grammar and consistency, and C.A.-de-L. for biostatistical content. All authors approved the final version of the manuscript.

Funding

This research was supported by BaseClear B.V. and NWO Gravitation: BRAINSCAPES—a roadmap from neurogenetics to neurobiology (grant no. 024.004.012).

Acknowledgments

The authors would like to thank Dr. Solon Pissis for the supervision of the thesis work of Yashjit Gangopadhyay which contributed to the development of the key microbial marker framework and method comparisons.

Supporting Information

Additional supporting information can be found online in the Supporting Information section.

Open Research

Data Availability Statement

The sequence data that support the findings of this study are openly available in the Sequence Read Archive at https://www.ncbi.nlm.nih.gov/sra, reference numbers SRP000383 and SRP262626, the European Nucleotide Archive at https://www.ebi.ac.uk/ena/browser/home, reference number ERP005534, and the NCBI BioProject database at https://www.ncbi.nlm.nih.gov/bioproject, reference number PRJNA433459; see also Table 1. No new sequence data was created or analyzed in this study.

Supporting Information

Filename	Description
agm39676659-sup-0001-f1.docxWord 2007 document , 4.7 MB	Supporting Information 1 Figure S1: This figure shows the results of the abundance and diversity analysis on the CRC dataset of Zeller et al. [28]. Figure S1A: The abundance of the five phyla which were found most significantly different between control and disease samples in the study by Zeller et al. (and the most present in the data). Figure S1B: the alpha diversity measure (using three different indices).
agm39676659-sup-0002-f2.docxWord 2007 document , 4.8 MB	Supporting Information 2 Figure S2: This figure shows the results of the abundance and diversity analysis on the Alzheimer’s disease (AD) dataset of Ling et al. [44]. Figure S2A: The abundance of the 10 phyla which were found most present in the Control and Disease samples. Figure S2B: the alpha diversity measure (using three different indices).
agm39676659-sup-0003-f3.docxWord 2007 document , 4.9 MB	Supporting Information 3 Figure S3: This figure shows the results of the abundance and diversity analysis on the Parkinson’s disease (PD) dataset of Qian et al. [48]. Figure S3A: The abundance of the 20 phyla which were found most present in the control and disease samples. Figure S3B: The alpha diversity measure (using three different indices).

Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

References

1 Fan Y. and Pedersen O., Gut Microbiota in Human Metabolic Health and Disease, Nature Reviews. Microbiology. (2021) 19, no. 1, 55–71, https://doi.org/10.1038/s41579-020-0433-9, 32887946.
10.1038/s41579-020-0433-9
CAS PubMed Web of Science® Google Scholar
2 Chen W., Liu F., Ling Z., Tong X., and Xiang C., Human Intestinal Lumen and Mucosa-Associated Microbiota in Patients With Colorectal Cancer, PLoS One. (2012) 7, no. 6, e39743, https://doi.org/10.1371/journal.pone.0039743, 2-s2.0-84862997111, 22761885.
10.1371/journal.pone.0039743
CAS PubMed Google Scholar
3 Clay S. L., Fonseca-Pereira D., and Garrett W. S., Colorectal Cancer: The Facts in the Case of the Microbiota, The Journal of Clinical Investigation. (2022) 132, no. 4, e155101, https://doi.org/10.1172/JCI155101, 35166235.
10.1172/JCI155101
CAS PubMed Google Scholar
4 Glassner K. L., Abraham B. P., and Quigley E. M. M., The Microbiome and Inflammatory Bowel Disease, The Journal of Allergy and Clinical Immunology. (2020) 145, no. 1, 16–27, https://doi.org/10.1016/j.jaci.2019.11.003.
10.1016/j.jaci.2019.11.003
CAS PubMed Web of Science® Google Scholar
5 Morgan X. C., Tickle T. L., Sokol H., Gevers D., Devaney K. L., Ward D. V., Reyes J. A., Shah S. A., LeLeiko N., Snapper S. B., Bousvaros A., Korzenik J., Sands B. E., Xavier R. J., and Huttenhower C., Dysfunction of the Intestinal Microbiome in Inflammatory Bowel Disease and Treatment, Genome Biology. (2012) 13, no. 9, https://doi.org/10.1186/gb-2012-13-9-r79, 23013615.
10.1186/gb-2012-13-9-r79
PubMed Web of Science® Google Scholar
6 Yang T., Santisteban M. M., Rodriguez V., Li E., Ahmari N., Carvajal J. M., Zadeh M., Gong M., Qi Y., Zubcevic J., Sahay B., Pepine C. J., Raizada M. K., and Mohamadzadeh M., Gut Dysbiosis Is Linked to Hypertension, Hypertension. (2015) 65, no. 6, 1331–1340, https://doi.org/10.1161/HYPERTENSIONAHA.115.05315, 2-s2.0-84937549863, 25870193.
10.1161/HYPERTENSIONAHA.115.05315
CAS PubMed Web of Science® Google Scholar
7 Marchesi J. R., Adams D. H., Fava F., Hermes G. D. A., Hirschfield G. M., Hold G., Quraishi M. N., Kinross J., Smidt H., Tuohy K. M., Thomas L. V., Zoetendal E. G., and Hart A., The Gut Microbiota and Host Health: A New Clinical Frontier, Gut. (2016) 65, no. 2, 330–339, https://doi.org/10.1136/gutjnl-2015-309990, 2-s2.0-84958841613, 26338727.
10.1136/gutjnl-2015-309990
PubMed Web of Science® Google Scholar
8 Singer-Englar T., Barlow G., and Mathur R., Obesity, Diabetes, and the Gut Microbiome: An Updated Review, Expert Review of Gastroenterology & Hepatology. (2019) 13, no. 1, 3–15, https://doi.org/10.1080/17474124.2019.1543023, 2-s2.0-85058620855.
10.1080/17474124.2019.1543023
CAS PubMed Web of Science® Google Scholar
9 Bairamian D., Sha S., Rolhion N., Sokol H., Dorothée G., Lemere C. A., and Krantic S., Microbiota in Neuroinflammation and Synaptic Dysfunction: A Focus on Alzheimer’s Disease, Molecular Neurodegeneration. (2022) 17, no. 1, https://doi.org/10.1186/s13024-022-00522-2, 35248147.
10.1186/s13024-022-00522-2
PubMed Google Scholar
10 Vogt N. M., Kerby R. L., Dill-McFarland K. A., Harding S. J., Merluzzi A. P., Johnson S. C., Carlsson C. M., Asthana S., Zetterberg H., Blennow K., Bendlin B. B., and Rey F. E., Gut Microbiome Alterations in Alzheimer’s Disease, Scientific Reports. (2017) 7, no. 1, https://doi.org/10.1038/s41598-017-13601-y, 2-s2.0-85041538555, 29051531.
10.1038/s41598-017-13601-y
PubMed Google Scholar
11 Aho V. T. E., Houser M. C., Pereira P. A. B., Chang J., Rudi K., Paulin L., Hertzberg V., Auvinen P., Tansey M. G., and Scheperjans F., Relationships of Gut Microbiota, Short-Chain Fatty Acids, Inflammation, and the Gut Barrier in Parkinson’s Disease, Molecular Neurodegeneration. (2021) 16, no. 1, https://doi.org/10.1186/s13024-021-00427-6, 33557896.
10.1186/s13024-021-00427-6
PubMed Web of Science® Google Scholar
12 Forsyth C. B., Shannon K. M., Kordower J. H., Voigt R. M., Shaikh M., Jaglin J. A., Estes J. D., Dodiya H. B., and Keshavarzian A., Increased Intestinal Permeability Correlates With Sigmoid Mucosa alpha-Synuclein Staining and Endotoxin Exposure Markers in Early Parkinson’s Disease, PLoS One. (2011) 6, no. 12, e28032, https://doi.org/10.1371/journal.pone.0028032, 2-s2.0-82455206282, 22145021.
10.1371/journal.pone.0028032
CAS PubMed Google Scholar
13 Duvallet C., Gibbons S. M., Gurry T., Irizarry R. A., and Alm E. J., Meta-Analysis of Gut Microbiome Studies Identifies Disease-Specific and Shared Responses, Nature Communications. (2017) 8, no. 1, https://doi.org/10.1038/s41467-017-01973-8, 2-s2.0-85037088224, 29209090.
10.1038/s41467-017-01973-8
PubMed Web of Science® Google Scholar
14 Manor O., Dai C. L., Kornilov S. A., Smith B., Price N. D., Lovejoy J. C., Gibbons S. M., and Magis A. T., Health and Disease Markers Correlate With Gut Microbiome Composition Across Thousands of People, Nature Communications. (2020) 11, no. 1, https://doi.org/10.1038/s41467-020-18871-1, 33060586.
10.1038/s41467-020-18871-1
PubMed Web of Science® Google Scholar
15 Fu J., Bonder M. J., Cenit M. C., Tigchelaar E. F., Maatman A., Dekens J. A. M., Brandsma E., Marczynska J., Imhann F., Weersma R. K., Franke L., Poon T. W., Xavier R. J., Gevers D., Hofker M. H., Wijmenga C., and Zhernakova A., The Gut Microbiome Contributes to a Substantial Proportion of the Variation in Blood Lipids, Circulation Research. (2015) 117, no. 9, 817–824, https://doi.org/10.1161/CIRCRESAHA.115.306807, 2-s2.0-84942870086, 26358192.
10.1161/CIRCRESAHA.115.306807
CAS PubMed Web of Science® Google Scholar
16 Alcolea D., Beeri M. S., Rojas J. C., Gardner R. C., and Lleó A., Blood Biomarkers in Neurodegenerative Diseases: Implications for the Clinical Neurologist, Neurology. (2023) 101, no. 4, 172–180, https://doi.org/10.1212/WNL.0000000000207193, 36878698.
10.1212/WNL.0000000000207193
CAS PubMed Google Scholar
17 Kluge A., Bunk J., Schaeffer E., Drobny A., Xiang W., Knacke H., Bub S., Lückstädt W., Arnold P., Lucius R., Berg D., and Zunke F., Detection of Neuron-Derived Pathological α-Synuclein in Blood, Brain. (2022) 145, no. 9, 3058–3071, https://doi.org/10.1093/brain/awac115, 35722765.
10.1093/brain/awac115
PubMed Google Scholar
18 Rinninella E., Raoul P., Cintoni M., Franceschi F., Miggiano G. A. D., Gasbarrini A., and Mele M. C., What is the Healthy Gut Microbiota Composition? A Changing Ecosystem Across Age, Environment, Diet, and Diseases, Microorganisms. (2019) 7, no. 1, https://doi.org/10.3390/microorganisms7010014, 2-s2.0-85071342398, 30634578.
10.3390/microorganisms7010014
PubMed Web of Science® Google Scholar
19 Xia Y., Correlation and Association Analyses in Microbiome Study Integrating Multiomics in Health and Disease, 2020, 171, Progress in Molecular Biology and Translational Science, https://doi.org/10.1016/bs.pmbts.2020.04.003.
10.1016/bs.pmbts.2020.04.003
Google Scholar
20 Behrouzi A., Nafari A. H., and Siadat S. D., The Significance of Microbiome in Personalized Medicine, Clinical and Translational Medicine. (2019) 8, no. 1, e16, https://doi.org/10.1186/s40169-019-0232-y, 31081530.
10.1186/s40169-019-0232-y
PubMed Google Scholar
21 Lopera-Maya E. A., Kurilshikov A., Van Der Graaf A., Hu S., Andreu-Sánchez S., Chen L., Vila A. V., Gacesa R., Sinha T., Collij V., Klaassen M. A. Y., Bolte L. A., Gois M. F. B., Neerincx P. B. T., Swertz M. A., Study L. L. C., Aguirre-Gamboa R., Deelen P., Franke L., Kuivenhoven J. A., Lopera-Maya E. A., Nolte I. M., Sanna S., Snieder H., Swertz M. A., Vonk J. M., Wijmenga C., Harmsen H. J. M., Wijmenga C., Fu J., Weersma R. K., Zhernakova A., and Sanna S., Effect of Host Genetics on the Gut Microbiome in 7, 738 Participants of the Dutch Microbiome Project, Nature Genetics. (2022) 54, no. 2, 143–151, https://doi.org/10.1038/s41588-021-00992-y.
10.1038/s41588-021-00992-y
CAS PubMed Google Scholar
22 Zhou Y.-H. and Gallins P., A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction, Frontiers in Genetics. (2019) 10, https://doi.org/10.3389/fgene.2019.00579, 2-s2.0-85068979894, 31293616.
10.3389/fgene.2019.00579
PubMed Google Scholar
23 Marcos-Zambrano L. J., Karaduzovic-Hadziabdic K., Loncar Turukalo T., Przymus P., Trajkovik V., Aasmets O., Berland M., Gruca A., Hasic J., Hron K., Klammsteiner T., Kolev M., Lahti L., Lopes M. B., Moreno V., Naskinova I., Org E., Paciência I., Papoutsoglou G., Shigdel R., Stres B., Vilne B., Yousef M., Zdravevski E., Tsamardinos I., Pau E. C. D. S., Claesson M. J., Moreno-Indias I., and Truu J., Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment, Frontiers in Microbiology. (2021) 12, 634511, https://doi.org/10.3389/fmicb.2021.634511, 33737920.
10.3389/fmicb.2021.634511
PubMed Google Scholar
24 Tibshirani R., The Lasso Method for Variable Selection in the Cox Model, Statistics in Medicine. (1997) 16, no. 4, 385–395, https://doi.org/10.1002/(SICI)1097-0258(19970228)16:4%3C385::AID-SIM380%3E3.0.CO;2-3.
10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3
CAS PubMed Web of Science® Google Scholar
25 Kursa M. B. and Rudnicki W. R., Feature Selection With the Boruta Package, Journal of Statistical Software. (2010) 36, no. 11, https://doi.org/10.18637/jss.v036.i11, 2-s2.0-77958158373.
10.18637/jss.v036.i11
PubMed Google Scholar
26 Segata N., Izard J., Waldron L., Gevers D., Miropolsky L., Garrett W. S., and Huttenhower C., Metagenomic Biomarker Discovery and Explanation, Genome Biology. (2011) 12, no. 6, https://doi.org/10.1186/gb-2011-12-6-r60, 2-s2.0-79959383523, 21702898.
10.1186/gb-2011-12-6-r60
PubMed Web of Science® Google Scholar
27 Wirbel J., Zych K., Essex M., Karcher N., Kartal E., Salazar G., Bork P., Sunagawa S., and Zeller G., Microbiome Meta-Analysis and Cross-Disease Comparison Enabled by the SIAMCAT Machine Learning Toolbox, Genome Biology. (2021) 22, no. 1, https://doi.org/10.1186/s13059-021-02306-1.
10.1186/s13059-021-02306-1
PubMed Google Scholar
28 Zeller G., Tap J., Voigt A. Y., Sunagawa S., Kultima J. R., Costea P. I., Amiot A., Böhm J., Brunetti F., Habermann N., Hercog R., Koch M., Luciani A., Mende D. R., Schneider M. A., Schrotz-King P., Tournigand C., Tran Van Nhieu J., Yamada T., Zimmermann J., Benes V., Kloor M., Ulrich C. M., Von Knebel Doeberitz M., Sobhani I., and Bork P., Potential of Fecal Microbiota for Early-Stage Detection of Colorectal Cancer, Molecular Systems Biology. (2014) 10, no. 11, https://doi.org/10.15252/msb.20145645, 2-s2.0-84924026956, 25432777.
10.15252/msb.20145645
PubMed Web of Science® Google Scholar
29 Tipton L., Cuenco K. T., Huang L., Greenblatt R. M., Kleerup E., Sciurba F., Duncan S. R., Donahoe M. P., Morris A., and Ghedin E., Measuring Associations Between the Microbiota and Repeated Measures of Continuous Clinical Variables Using a Lasso-Penalized Generalized Linear Mixed Model, BioData Mining. (2018) 11, no. 1, https://doi.org/10.1186/s13040-018-0173-9, 2-s2.0-85048654016, 29983746.
10.1186/s13040-018-0173-9
PubMed Google Scholar
30 Wirbel J., Pyl P. T., Kartal E., Zych K., Kashani A., Milanese A., Fleck J. S., Voigt A. Y., Palleja A., Ponnudurai R., Sunagawa S., Coelho L. P., Schrotz-King P., Vogtmann E., Habermann N., Niméus E., Thomas A. M., Manghi P., Gandini S., Serrano D., Mizutani S., Shiroma H., Shiba S., Shibata T., Yachida S., Yamada T., Waldron L., Naccarati A., Segata N., Sinha R., Ulrich C. M., Brenner H., Arumugam M., Bork P., and Zeller G., Meta-Analysis of Fecal Metagenomes Reveals Global Microbial Signatures That Are Specific for Colorectal Cancer, Nature Medicine. (2019) 25, no. 4, 679–689, https://doi.org/10.1038/s41591-019-0406-6, 2-s2.0-85063786613, 30936547.
10.1038/s41591-019-0406-6
CAS PubMed Web of Science® Google Scholar
31 Schmidt T. S., Hayward M. R., Coelho L. P., Li S. S., Costea P. I., Voigt A. Y., Wirbel J., Maistrenko O. M., Alves R. J., Bergsten E., De Beaufort C., Sobhani I., Heintz-Buschart A., Sunagawa S., Zeller G., Wilmes P., and Bork P., Extensive Transmission of Microbes along the Gastrointestinal Tract, elife. (2019) 8, e42693, https://doi.org/10.7554/eLife.42693, 2-s2.0-85063255229, 30747106.
10.7554/eLife.42693
CAS PubMed Google Scholar
32 Zhao C. Y., Hao Y., Wang Y., Varga J. J., Stecenko A. A., Goldberg J. B., and Brown S. P., Microbiome Data Enhances Predictive Models of Lung Function in People With Cystic Fibrosis, The Journal of Infectious Diseases. (2021) 223, no. Supplement_3, S246–S256, https://doi.org/10.1093/infdis/jiaa655, 33330902.
10.1093/infdis/jiaa655
CAS PubMed Google Scholar
33 Priya S., Burns M. B., Ward T., Mars R. A. T., Adamowicz B., Lock E. F., Kashyap P. C., Knights D., and Blekhman R., Identification of Shared and Disease-Specific Host Gene–Microbiome Associations Across Human Diseases Using Multi-Omic Integration, Nature Microbiology. (2022) 7, no. 6, 780–795, https://doi.org/10.1038/s41564-022-01121-z, 35577971.
10.1038/s41564-022-01121-z
CAS PubMed Google Scholar
34 Breiman L., Random Forests, Machine Learning. (2001) 45, 5–32, https://doi.org/10.1023/A:1010933404324, 2-s2.0-0035478854.
10.1023/A:1010933404324
Web of Science® Google Scholar
35 Gong T.-T., He X.-H., Gao S., and Wu Q.-J., Application of Machine Learning in Prediction of Chemotherapy Resistant of Ovarian Cancer Based on Gut Microbiota, Journal of Cancer. (2021) 12, no. 10, 2877–2885, https://doi.org/10.7150/jca.46621, 33854588.
10.7150/jca.46621
CAS PubMed Google Scholar
36 Adams K., Tutorial the Gini Impurity Index and What it Means and How to Calculate It, 2018, https://doi.org/10.13140/RG.2.2.23592.21769.
10.13140/RG.2.2.23592.21769
Google Scholar
37 Degenhardt F., Seifert S., and Szymczak S., Evaluation of Variable Selection Methods for Random Forests and Omics Data Sets, Briefings in Bioinformatics. (2019) 20, no. 2, 492–503, https://doi.org/10.1093/bib/bbx124, 2-s2.0-85063712587, 29045534.
10.1093/bib/bbx124
PubMed Web of Science® Google Scholar
38 Acharjee A., Larkman J., Xu Y., Cardoso V. R., and Gkoutos G. V., A Random Forest Based Biomarker Discovery and Power Analysis Framework for Diagnostics Research, BMC Medical Genomics. (2020) 13, no. 1, https://doi.org/10.1186/s12920-020-00826-6, 33228632.
10.1186/s12920-020-00826-6
PubMed Google Scholar
39 Kruskal W. H. and Wallis W. A., Use of Ranks in One-Criterion Variance Analysis, Journal of the American Statistical Association. (1952) 47, no. 260, 583–621, https://doi.org/10.1080/01621459.1952.10483441, 2-s2.0-84943709252.
10.1080/01621459.1952.10483441
Web of Science® Google Scholar
40 Wilcoxon F., S. Kotz and N. L. Johnson, Individual Comparisons by Ranking Methods, Breakthroughs in Statistics: Methodology and Distribution, 1992, Springer, 196–202, https://doi.org/10.1007/978-1-4612-4380-9_16.
10.1007/978-1-4612-4380-9_16
Google Scholar
41 Mann H. B. and Whitney D. R., On a Test of Whether one of Two Random Variables is Stochastically Larger Than the Other, Annals of Mathematical Statistics. (1947) 18, no. 1, 50–60, https://doi.org/10.1214/aoms/1177730491.
10.1214/aoms/1177730491
Web of Science® Google Scholar
42 Fisher R. A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics. (1936) 7, no. 2, 179–188, https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
10.1111/j.1469-1809.1936.tb02137.x
PubMed Web of Science® Google Scholar
43 Gandy K. A. O., Zhang J., Nagarkatti P., and Nagarkatti M., The Role of Gut Microbiota in Shaping the Relapse-Remitting and Chronic-Progressive Forms of Multiple Sclerosis in Mouse Models, Scientific Reports. (2019) 9, no. 1, https://doi.org/10.1038/s41598-019-43356-7, 2-s2.0-85065304962, 31061496.
10.1038/s41598-019-43356-7
PubMed Google Scholar
44 Ling Z., Zhu M., Liu X., Shao L., Cheng Y., Yan X., Jiang R., and Wu S., Fecal Fungal Dysbiosis in Chinese Patients With Alzheimer’s Disease, Frontiers in Cell and Development Biology. (2021) 8, 631460, https://doi.org/10.3389/fcell.2020.631460, 33585471.
10.3389/fcell.2020.631460
PubMed Google Scholar
45 Xu C., Chen Y., Zang Q., Li Y., Zhao J., Lu X., Jiang M., Zhuang H., and Huang L., The Effects of Cultivation Patterns and Nitrogen Levels on Fertility and Bacterial Community Characteristics of Surface and Subsurface Soil, Frontiers in Microbiology. (2023) 14, 1072228, https://doi.org/10.3389/fmicb.2023.1072228, 36876089.
10.3389/fmicb.2023.1072228
PubMed Google Scholar
46 Kostic A. D., Gevers D., Pedamallu C. S., Michaud M., Duke F., Earl A. M., Ojesina A. I., Jung J., Bass A. J., Tabernero J., Baselga J., Liu C., Shivdasani R. A., Ogino S., Birren B. W., Huttenhower C., Garrett W. S., and Meyerson M., Genomic Analysis Identifies Association of Fusobacterium With Colorectal Carcinoma, Genome Research. (2012) 22, no. 2, 292–298, https://doi.org/10.1101/gr.126573.111, 2-s2.0-84863022950, 22009990.
10.1101/gr.126573.111
CAS PubMed Web of Science® Google Scholar
47 Balakrishnan B., Luckey D., Wright K., Davis J. M., Chen J., and Taneja V., Eggerthella lenta Augments Preclinical Autoantibody Production and Metabolic Shift Mimicking Senescence in Arthritis, Science Advances. (2023) 9, no. 35, https://doi.org/10.1126/sciadv.adg1129, 37656793.
10.1126/sciadv.adg1129
PubMed Google Scholar
48 Qian Y., Yang X., Xu S., Huang P., Li B., Du J., He Y., Su B., Xu L.-M., Wang L., Huang R., Chen S., and Xiao Q., Gut Metagenomics-Derived Genes as Potential Biomarkers of Parkinson’s Disease, Brain. (2020) 143, no. 8, 2474–2489, https://doi.org/10.1093/brain/awaa201, 32844199.
10.1093/brain/awaa201
PubMed Google Scholar
49 Tan A. H., Lim S. Y., and Lang A. E., The Microbiome–Gut–Brain Axis in Parkinson Disease — From Basic Research to the Clinic, Nature Reviews. Neurology. (2022) 18, no. 8, 476–495, https://doi.org/10.1038/s41582-022-00681-2, 35750883.
10.1038/s41582-022-00681-2
PubMed Google Scholar
50 Bedarf J. R., Hildebrand F., Coelho L. P., Sunagawa S., Bahram M., Goeser F., Bork P., and Wüllner U., Functional Implications of Microbial and Viral Gut Metagenome Changes in Early Stage L-DOPA-Naïve Parkinson’s Disease Patients, Genome Medicine. (2017) 9, no. 1, https://doi.org/10.1186/s13073-017-0428-y, 2-s2.0-85018307343, 28449715.
10.1186/s13073-017-0428-y
PubMed Google Scholar
51 Baldini F., Hertel J., Sandt E., Thinnes C. C., Neuberger-Castillo L., Pavelka L., Betsou F., Krüger R., Thiele I., on behalf of the NCER-PD Consortium, Aguayo G., Allen D., Ammerlann W., Aurich M., Balling R., Banda P., Beaumont K., Becker R., Berg D., Binck S., Bisdorff A., Bobbili D., Brockmann K., Calmes J., Castillo L., Diederich N., Dondelinger R., Esteves D., Ferrand J.-Y., Fleming R., Gantenbein M., Gasser T., Gawron P., Geffers L., Giarmana V., Glaab E., Gomes C. P. C., Goncharenko N., Graas J., Graziano M., Groues V., Grünewald A., Gu W., Hammot G., Hanff A.-M., Hansen L., Hansen M., Haraldsdöttir H., Heirendt L., Herbrink S., Herzinger S., Heymann M., Hiller K., Hipp G., Hu M., Huiart L., Hundt A., Jacoby N., Jarosław J., Jaroz Y., Kolber P., Kutzera J., Landoulsi Z., Larue C., Lentz R., Liepelt I., Liszka R., Longhino L., Lorentz V., Mackay C., Maetzler W., Marcus K., Marques G., Martens J., Mathay C., Matyjaszczyk P., May P., Meisch F., Menster M., Minelli M., Mittelbronn M., Mollenhauer B., Mommaerts K., Moreno C., Mühlschlegel F., Nati R., Nehrbass U., Nickels S., Nicolai B., Nicolay J.-P., Noronha A., Oertel W., Ostaszewski M., Pachchek S., Pauly C., Perquin M., Reiter D., Rosety I., Rump K., Satagopam V., Schlesser M., Schmitz S., Schmitz S., Schneider R., Schwamborn J., Schweicher A., Simons J., Stute L., Trefois C., Trezzi J.-P., Vaillant M., Vasco D., Vyas M., Wade-Martins R., and Wilmes P., Parkinson’s Disease-Associated Alterations of the Gut Microbiome Predict Disease-Relevant Changes in Metabolic Functions, BMC Biology. (2020) 18, https://doi.org/10.1186/s12915-020-00775-7.
10.1186/s12915-020-00775-7
PubMed Google Scholar
52 Cirstea M. S., Yu A. C., Golz E., Sundvick K., Kliger D., Radisavljevic N., Foulger L. H., Mackenzie M., Huan T., Finlay B. B., and Appel-Cresswell S., Microbiota Composition and Metabolism Are Associated With Gut Function in Parkinson’s Disease, Movement Disorders. (2020) 35, no. 7, 1208–1217, https://doi.org/10.1002/mds.28052.
10.1002/mds.28052
CAS PubMed Web of Science® Google Scholar
53 Shannon C. E., A Mathematical Theory of Communication, Bell System Technical Journal. (1948) 27, no. 3, 379–423, https://doi.org/10.1002/j.1538-7305.1948.tb01338.x, 2-s2.0-84940644968.
10.1002/j.1538-7305.1948.tb01338.x
Web of Science® Google Scholar
54 Simpson E. H., Measurement of Diversity, Nature. (1949) 163, no. 4148, 688–688, https://doi.org/10.1038/163688a0, 2-s2.0-33344464667.
10.1038/163688a0
CAS Web of Science® Google Scholar
55 McMurdie P. J. and Holmes S., Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data, PLoS One. (2013) 8, no. 4, e61217, https://doi.org/10.1371/journal.pone.0061217, 2-s2.0-84876427223, 23630581.
10.1371/journal.pone.0061217
CAS PubMed Web of Science® Google Scholar
56 Edgar R. C., Search and Clustering Orders of Magnitude Faster Than BLAST, Bioinformatics. (2010) 26, no. 19, 2460–2461, https://doi.org/10.1093/bioinformatics/btq461, 2-s2.0-77957244650, 20709691.
10.1093/bioinformatics/btq461
CAS PubMed Web of Science® Google Scholar
57 Cole J. R., Wang Q., Fish J. A., Chai B., McGarrell D. M., Sun Y., Brown C. T., Porras-Alfaro A., Kuske C. R., and Tiedje J. M., Ribosomal Database Project: Data and Tools for High Throughput rRNA Analysis, Nucleic Acids Research. (2014) 42, no. D1, D633–D642, https://doi.org/10.1093/nar/gkt1244, 2-s2.0-84891787231, 24288368.
10.1093/nar/gkt1244
CAS PubMed Google Scholar
58 Truong D. T., Tett A., Pasolli E., Huttenhower C., and Segata N., Microbial Strain-Level Population Structure and Genetic Diversity From Metagenomes, Genome Research. (2017) 27, no. 4, 626–638, https://doi.org/10.1101/gr.216242.116, 2-s2.0-85017527948.
10.1101/gr.216242.116
CAS PubMed Google Scholar
59 Meta HIT Consortium, Nielsen H. B., Almeida M., Juncker A. S., Rasmussen S., Li J., Sunagawa S., Plichta D. R., Gautier L., Pedersen A. G., Le Chatelier E., Pelletier E., Bonde I., Nielsen T., Manichanh C., Arumugam M., Batto J.-M., Quintanilha Dos Santos M. B., Blom N., Borruel N., Burgdorf K. S., Boumezbeur F., Casellas F., Doré J., Dworzynski P., Guarner F., Hansen T., Hildebrand F., Kaas R. S., Kennedy S., Kristiansen K., Kultima J. R., Léonard P., Levenez F., Lund O., Moumen B., Le Paslier D., Pons N., Pedersen O., Prifti E., Qin J., Raes J., Sørensen S., Tap J., Tims S., Ussery D. W., Yamada T., Renault P., Sicheritz-Ponten T., Bork P., Wang J., Brunak S., and Ehrlich S. D., Identification and Assembly of Genomes and Genetic Elements in Complex Metagenomic Samples Without Using Reference Genomes, Nature Biotechnology. (2014) 32, no. 8, 822–828, https://doi.org/10.1038/nbt.2939, 2-s2.0-84905730761.
10.1038/nbt.2939
PubMed Google Scholar
60 Brüssow H., Problems With the Concept of Gut Microbiota Dysbiosis, Microbial Biotechnology. (2020) 13, no. 2, 423–434, https://doi.org/10.1111/1751-7915.13479, 2-s2.0-85071142853, 31448542.
10.1111/1751-7915.13479
PubMed Web of Science® Google Scholar
61 Wood D. E., Lu J., and Langmead B., Improved Metagenomic Analysis With Kraken 2, Genome Biology. (2019) 20, no. 1, https://doi.org/10.1186/s13059-019-1891-0, 31779668.
10.1186/s13059-019-1891-0
PubMed Web of Science® Google Scholar
62 Lu J., Breitwieser F. P., Thielen P., and Salzberg S. L., Bracken: Estimating Species Abundance in Metagenomics Data, PeerJ - Computer Science. (2017) 3, e104, https://doi.org/10.7717/peerj-cs.104, 2-s2.0-85026281530.
10.7717/peerj-cs.104
PubMed Google Scholar
63 Cardoso T. F., Cánovas A., Canela-Xandri O., González-Prendes R., Amills M., and Quintanilla R., RNA-seq Based Detection of Differentially Expressed Genes in the Skeletal Muscle of Duroc Pigs With Distinct Lipid Profiles, Scientific Reports. (2017) 7, no. 1, 40005, https://doi.org/10.1038/srep40005, 2-s2.0-85012240010, 28195222.
10.1038/srep40005
CAS PubMed Google Scholar
64 Ahn J., Sinha R., Pei Z., Dominianni C., Wu J., Shi J., Goedert J. J., Hayes R. B., and Yang L., Human Gut Microbiome and Risk For Colorectal Cancer, JNCI Journal of the National Cancer Institute. (2013) 105, no. 24, 1907–1911, https://doi.org/10.1093/jnci/djt300, 2-s2.0-84891473723, 24316595.
10.1093/jnci/djt300
CAS PubMed Web of Science® Google Scholar
65 Ai D., Pan H., Li X., Gao Y., Liu G., and Xia L. C., Identifying Gut Microbiota Associated With Colorectal Cancer Using a Zero-Inflated Lognormal Model, Frontiers in Microbiology. (2019) 10, https://doi.org/10.3389/fmicb.2019.00826, 2-s2.0-85066139662, 31068913.
10.3389/fmicb.2019.00826
PubMed Google Scholar
66 Luo Q., Zhou P., Chang S., Huang Z., and Zeng X., Characterization of Butyrate-Metabolism in Colorectal Cancer to Guide Clinical Treatment, Scientific Reports. (2023) 13, no. 1, https://doi.org/10.1038/s41598-023-32457-z, 36991138.
10.1038/s41598-023-32457-z
PubMed Google Scholar
67 Zhang H., Chang Y., Zheng Q., Zhang R., Hu C., and Jia W., Altered Intestinal Microbiota Associated With Colorectal Cancer, Frontiers in Medicine. (2019) 13, no. 4, 461–470, https://doi.org/10.1007/s11684-019-0695-7, 2-s2.0-85068227667.
10.1007/s11684-019-0695-7
Google Scholar
68 Wang Y., Wan X., Wu X., Zhang C., Liu J., and Hou S., Eubacterium Rectale Contributes to Colorectal Cancer Initiation via Promoting Colitis, Gut Pathogens. (2021) 13, no. 1, https://doi.org/10.1186/s13099-020-00396-z, 33436075.
10.1186/s13099-020-00396-z
PubMed Google Scholar
69 Ryu S. W., Kim J.-S., Oh B. S., Choi W. J., Yu S. Y., Bak J. E., Park S.-H., Kang S. W., Lee J., Jung W. Y., Lee J.-S., and Lee J. H., Gut Microbiota Eubacterium callanderi Exerts Anti-Colorectal Cancer Activity, Microbiology Spectrum. (2022) 10, no. 6, e02531, https://doi.org/10.1128/spectrum.02531-22, 36448791.
10.1128/spectrum.02531-22
PubMed Google Scholar
70 He Z., Gharaibeh R. Z., Newsome R. C., Pope J. L., Dougherty M. W., Tomkovich S., Pons B., Mirey G., Vignard J., Hendrixson D. R., and Jobin C., Campylobacter jejuni Promotes Colorectal Tumorigenesis Through the Action of cytolethal Distending Toxin, Gut. (2019) 68, no. 2, 289–300, https://doi.org/10.1136/gutjnl-2018-317200, 2-s2.0-85056091623, 30377189.
10.1136/gutjnl-2018-317200
CAS PubMed Google Scholar
71 Zhu H., Li M., Bi D., Yang H., Gao Y., Song F., Zheng J., Xie R., Zhang Y., Liu H., Yan X., Kong C., Zhu Y., Xu Q., Wei Q., and Qin H., Fusobacterium Nucleatum Promotes Tumor Progression in KRAS p.G12D-Mutant Colorectal Cancer by Binding to DHX15, Nature Communications. (2024) 15, no. 1, 1688, https://doi.org/10.1038/s41467-024-45572-w, 38402201.
10.1038/s41467-024-45572-w
CAS PubMed Google Scholar
72 White M. T. and Sears C. L., The Microbial Landscape of Colorectal Cancer, Nature Reviews. Microbiology. (2024) 22, no. 4, 240–254, https://doi.org/10.1038/s41579-023-00973-4, 37794172.
10.1038/s41579-023-00973-4
CAS PubMed Google Scholar
73 Okumura S., Konishi Y., Narukawa M., Sugiura Y., Yoshimoto S., Arai Y., Sato S., Yoshida Y., Tsuji S., Uemura K., Wakita M., Matsudaira T., Matsumoto T., Kawamoto S., Takahashi A., Itatani Y., Miki H., Takamatsu M., Obama K., Takeuchi K., Suematsu M., Ohtani N., Fukunaga Y., Ueno M., Sakai Y., Nagayama S., and Hara E., Gut Bacteria Identified in Colorectal Cancer Patients Promote Tumourigenesis via Butyrate Secretion, Nature Communications. (2021) 12, no. 1, https://doi.org/10.1038/s41467-021-25965-x, 34584098.
10.1038/s41467-021-25965-x
PubMed Google Scholar
74 Yachida S., Mizutani S., Shiroma H., Shiba S., Nakajima T., Sakamoto T., Watanabe H., Masuda K., Nishimoto Y., Kubo M., Hosoda F., Rokutan H., Matsumoto M., Takamaru H., Yamada M., Matsuda T., Iwasaki M., Yamaji T., Yachida T., Soga T., Kurokawa K., Toyoda A., Ogura Y., Hayashi T., Hatakeyama M., Nakagama H., Saito Y., Fukuda S., Shibata T., and Yamada T., Metagenomic and Metabolomic Analyses Reveal Distinct Stage-Specific Phenotypes of the Gut Microbiota in Colorectal Cancer, Nature Medicine. (2019) 25, no. 6, 968–976, https://doi.org/10.1038/s41591-019-0458-7, 2-s2.0-85067007334, 31171880.
10.1038/s41591-019-0458-7
CAS PubMed Web of Science® Google Scholar
75 Purcell R. V., Pearson J., Aitchison A., Dixon L., Frizelle F. A., and Keenan J. I., Colonization With Enterotoxigenic Bacteroides fragilis Is Associated With Early-Stage Colorectal Neoplasia, PLoS One. (2017) 12, no. 2, e0171602, https://doi.org/10.1371/journal.pone.0171602, 2-s2.0-85011319812, 28151975.
10.1371/journal.pone.0171602
PubMed Google Scholar
76 Zhao L., Zhang X., Zhou Y., Fu K., Lau H. C.-H., Chun T. W.-Y., Cheung A. H.-K., Coker O. O., Wei H., Wu W. K.-K., Wong S. H., Sung J. J.-Y., To K. F., and Yu J., Parvimonas Micra Promotes Colorectal Tumorigenesis and Is Associated With Prognosis of Colorectal Cancer Patients, Oncogene. (2022) 41, no. 36, 4200–4210, https://doi.org/10.1038/s41388-022-02395-7, 35882981.
10.1038/s41388-022-02395-7
CAS PubMed Google Scholar
77 Zhang H., Wu J., Ji D., Liu Y., Lu S., Lin Z., Chen T., and Ao L., Microbiome Analysis Reveals Universal Diagnostic Biomarkers for Colorectal Cancer Across Populations and Technologies, Frontiers in Microbiology. (2022) 13, https://doi.org/10.3389/fmicb.2022.1005201, 36406447.
10.3389/fmicb.2022.1005201
PubMed Google Scholar
78 Warren R. L., Freeman D. J., Pleasance S., Watson P., Moore R. A., Cochrane K., Allen-Vercoe E., and Holt R. A., Co-Occurrence of Anaerobic Bacteria in Colorectal Carcinomas, Microbiome. (2013) 1, no. 1, https://doi.org/10.1186/2049-2618-1-16, 2-s2.0-84896066661, 24450771.
10.1186/2049-2618-1-16
PubMed Web of Science® Google Scholar
79 Zhuang Z.-Q., Shen L.-L., Li W.-W., Fu X., Zeng F., Gui L., Lü Y., Cai M., Zhu C., Tan Y.-L., Zheng P., Li H.-Y., Zhu J., Zhou H.-D., Bu X.-L., and Wang Y.-J., Gut Microbiota Is Altered in Patients with Alzheimer’s Disease, Journal of Alzheimer′s Disease. (2018) 63, no. 4, 1337–1346, https://doi.org/10.3233/JAD-180176, 2-s2.0-85048627786.
10.3233/JAD-180176
CAS PubMed Web of Science® Google Scholar
80 DeSantis T. Z., Hugenholtz P., Larsen N., Rojas M., Brodie E. L., Keller K., Huber T., Dalevi D., Hu P., and Andersen G. L., Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible With ARB, Applied and Environmental Microbiology. (2006) 72, no. 7, 5069–5072, https://doi.org/10.1128/AEM.03006-05, 2-s2.0-33746061683, 16820507.
10.1128/AEM.03006-05
CAS PubMed Web of Science® Google Scholar
81 Campos P. M., Darwish N., Shao J., and Proszkowiec-Weglarz M., Research Note: Choice of Microbiota Database Affects Data Analysis and Interpretation in Chicken Cecal Microbiota, Poultry Science. (2022) 101, no. 8, 101971, https://doi.org/10.1016/j.psj.2022.101971, 35759996.
10.1016/j.psj.2022.101971
PubMed Google Scholar
82 Fitzgerald C. B., Shkoporov A. N., Sutton T. D. S., Chaplin A. V., Velayudhan V., Ross R. P., and Hill C., Comparative Analysis of Faecalibacterium prausnitzii Genomes Shows a High Level of Genome Plasticity and Warrants Separation Into New Species-Level Taxa, BMC Genomics. (2018) 19, no. 1, https://doi.org/10.1186/s12864-018-5313-6, 2-s2.0-85058596678, 30547746.
10.1186/s12864-018-5313-6
PubMed Google Scholar
83 Zhu G., Zhao J., Zhang H., Wang G., and Chen W., Gut Microbiota and Its Metabolites: Bridge of Dietary Nutrients and Alzheimer’s Disease, Advances in Nutrition. (2023) 14, no. 4, 819–839, https://doi.org/10.1016/j.advnut.2023.04.005, 37075947.
10.1016/j.advnut.2023.04.005
CAS PubMed Google Scholar
84 Dalile B., Van Oudenhove L., Vervliet B., and Verbeke K., The Role of Short-Chain Fatty Acids in Microbiota–Gut–Brain Communication, Nature Reviews. Gastroenterology & Hepatology. (2019) 16, no. 8, 461–478, https://doi.org/10.1038/s41575-019-0157-3, 2-s2.0-85066887841.
10.1038/s41575-019-0157-3
PubMed Google Scholar
85 Silva Y. P., Bernardi A., and Frozza R. L., The Role of Short-Chain Fatty Acids From Gut Microbiota in Gut-Brain Communication, Frontiers in Endocrinology. (2020) 11, https://doi.org/10.3389/fendo.2020.00025, 32082260.
10.3389/fendo.2020.00025
PubMed Web of Science® Google Scholar
86 Liu P., Wu L., Peng G., Han Y., Tang R., Ge J., Zhang L., Jia L., Yue S., Zhou K., Li L., Luo B., and Wang B., Altered Microbiomes Distinguish Alzheimer’s Disease From Amnestic Mild Cognitive Impairment and Health in a Chinese Cohort, Brain, Behavior, and Immunity. (2019) 80, 633–643, https://doi.org/10.1016/j.bbi.2019.05.008, 2-s2.0-85065546840.
10.1016/j.bbi.2019.05.008
PubMed Web of Science® Google Scholar
87 Hodgkinson K., El Abbar F., Dobranowski P., Manoogian J., Butcher J., Figeys D., Mack D., and Stintzi A., Butyrate’s Role in Human Health and the Current Progress Towards Its Clinical Application to Treat Gastrointestinal Disease, Clinical Nutrition. (2023) 42, no. 2, 61–75, https://doi.org/10.1016/j.clnu.2022.10.024, 36502573.
10.1016/j.clnu.2022.10.024
CAS PubMed Google Scholar
88 Hung C.-C., Chang C.-C., Huang C.-W., Nouchi R., and Cheng C.-H., Gut Microbiota in Patients With Alzheimer’s Disease Spectrum: A Systematic Review and Meta-Analysis, Aging. (2022) 14, no. 1, 477–496, https://doi.org/10.18632/aging.203826, 35027502.
10.18632/aging.203826
PubMed Google Scholar
89 Yıldırım S., Nalbantoğlu Ö. U., Bayraktar A., Ercan F. B., Gündoğdu A., Velioğlu H. A., Göl M. F., Soylu A. E., Koç F., Gülpınar E. A., Kadak K. S., Arıkan M., Mardinoğlu A., Koçak M., Köseoğlu E., and Hanoğlu L., Stratification of the Gut Microbiota Composition Landscape across the Alzheimer’s Disease Continuum in a Turkish Cohort, Msystems. (2022) 7, no. 1, e0000422, https://doi.org/10.1128/msystems.00004-22, 35133187.
10.1128/msystems.00004-22
PubMed Google Scholar
90 Wanapaisan P., Chuansangeam M., Nopnipa S., Mathuranyanon R., Nonthabenjawan N., Ngamsombat C., Thientunyakit T., and Muangpaisan W., Association Between Gut Microbiota with Mild Cognitive Impairment and Alzheimer’s Disease in a Thai Population, Neurodegenerative Diseases. (2023) 22, no. 2, 43–54, https://doi.org/10.1159/000526947.
10.1159/000526947
Google Scholar
91 Cattaneo A., Cattane N., Galluzzi S., Provasi S., Lopizzo N., Festari C., Ferrari C., Guerra U. P., Paghera B., Muscio C., Bianchetti A., Volta G. D., Turla M., Cotelli M. S., Gennuso M., Prelle A., Zanetti O., Lussignoli G., Mirabile D., Bellandi D., Gentile S., Belotti G., Villani D., Harach T., Bolmont T., Padovani A., Boccardi M., Frisoni G. B., and INDIA-FBP Group, Association of Brain Amyloidosis With Pro-Inflammatory Gut Bacterial Taxa and Peripheral Inflammation Markers in Cognitively Impaired Elderly, Neurobiology of Aging. (2017) 49, 60–68, https://doi.org/10.1016/j.neurobiolaging.2016.08.019, 2-s2.0-84993945111, 27776263.
10.1016/j.neurobiolaging.2016.08.019
CAS PubMed Web of Science® Google Scholar
92 Haran J. P., Bhattarai S. K., Foley S. E., Dutta P., Ward D. V., Bucci V., and McCormick B. A., Alzheimer’s Disease Microbiome Is Associated With Dysregulation of the Anti-Inflammatory P-Glycoprotein Pathway, mBio. (2019) 10, e00632, https://doi.org/10.1128/mBio.00632-19, 2-s2.0-85065785061.
10.1128/mBio.00632-19
CAS PubMed Google Scholar
93 Underly R., Song M.-S., Dunbar G. L., and Weaver C. L., Expression of Alzheimer-Type Neurofibrillary Epitopes in Primary Rat Cortical Neurons Following Infection with Enterococcus faecalis, Frontiers in Aging Neuroscience. (2016) 7, https://doi.org/10.3389/fnagi.2015.00259, 2-s2.0-84960121738, 26834627.
10.3389/fnagi.2015.00259
PubMed Google Scholar
94 Sun J., Sundsfjord A., and Song X., Enterococcus faecalis From Patients With Chronic Periodontitis: Virulence and Antimicrobial Resistance Traits and Determinants, European Journal of Clinical Microbiology & Infectious Diseases. (2012) 31, no. 3, 267–272, https://doi.org/10.1007/s10096-011-1305-z, 2-s2.0-84857052296, 21660501.
10.1007/s10096-011-1305-z
CAS PubMed Google Scholar
95 Mylona E., Vadala C., Papastamopoulos V., and Skoutelis A., Brain Abscess Caused by Enterococcus faecalis Following a Dental Procedure in a Patient with Hereditary Hemorrhagic Telangiectasia, Journal of Clinical Microbiology. (2012) 50, no. 5, 1807–1809, https://doi.org/10.1128/JCM.06658-11, 2-s2.0-84860003364, 22337991.
10.1128/JCM.06658-11
PubMed Google Scholar
96 Hou M., Xu G., Ran M., Luo W., and Wang H., APOE-ε4 Carrier Status and Gut Microbiota Dysbiosis in Patients With Alzheimer Disease, Frontiers in Neuroscience. (2021) 15, 619051, https://doi.org/10.3389/fnins.2021.619051, 33732104.
10.3389/fnins.2021.619051
PubMed Google Scholar
97 Avram-Hananel L., Stock J., Parlesak A., Bode C., and Schwartz B., E Durans Strain M4–5 Isolated From Human Colonic Flora Attenuates Intestinal Inflammation, Diseases of the Colon and Rectum. (2010) 53, no. 12, 1676–1686, https://doi.org/10.1007/DCR.0b013e3181f4b148, 2-s2.0-78651487211, 21178864.
10.1007/DCR.0b013e3181f4b148
PubMed Google Scholar
98 Xi J., Ding D., Zhu H., Wang R., Su F., Wu W., Xiao Z., Liang X., Zhao Q., Hong Z., Fu H., and Xiao Q., Disturbed Microbial Ecology in Alzheimer’s Disease: Evidence From the Gut Microbiota and Fecal Metabolome, BMC Microbiology. (2021) 21, no. 1, https://doi.org/10.1186/s12866-021-02286-z, 34384375.
10.1186/s12866-021-02286-z
PubMed Google Scholar
99 Milosavljevic M. N., Kostic M., Milovanovic J., Zaric R. Z., Stojadinovic M., Jankovic S. M., and Stefanovic S. M., Antimicrobial Treatment of Erysipelatoclostridium ramosum Invasive Infections: A Systematic Review, Revista do Instituto de Medicina Tropical de Sao Paulo. (2021) 63, e30, https://doi.org/10.1590/s1678-9946202163030.
10.1590/s1678-9946202163030
CAS PubMed Google Scholar
100 Selma M. V., Beltrán D., García-Villalba R., Espín J. C., and Tomás-Barberán F. A., Description of Urolithin Production Capacity From Ellagic Acid of Two Human Intestinal Gordonibacter species, Food & Function. (2014) 5, no. 8, 1779–1784, https://doi.org/10.1039/C4FO00092G, 2-s2.0-84905049318, 24909569.
10.1039/C4FO00092G
CAS PubMed Google Scholar
101 Kujawska M. and Jodynis-Liebert J., Polyphenols in Parkinson’s Disease: A Systematic Review of In Vivo Studies, Nutrients. (2018) 10, no. 5, https://doi.org/10.3390/nu10050642, 2-s2.0-85047299567.
10.3390/nu10050642
Google Scholar
102 Romo-Vaquero M., Fernández-Villalba E., Gil-Martinez A.-L., Cuenca-Bermejo L., Espín J. C., Herrero M. T., and Selma M. V., Urolithins: Potential Biomarkers of Gut Dysbiosis and Disease Stage in Parkinson’s Patients, Food & Function. (2022) 13, no. 11, 6306–6316, https://doi.org/10.1039/D2FO00552B, 35611932.
10.1039/D2FO00552B
CAS PubMed Google Scholar
103 Lubomski M., Xu X., Holmes A. J., Muller S., Yang J. Y. H., Davis R. L., and Sue C. M., The Gut Microbiome in Parkinson’s Disease: A Longitudinal Study of the Impacts on Disease Progression and the Use of Device-Assisted Therapies, Frontiers in Aging Neuroscience. (2022) 14, 875261, https://doi.org/10.3389/fnagi.2022.875261.
10.3389/fnagi.2022.875261
CAS PubMed Web of Science® Google Scholar
104 Maini Rekdal V., Bess E. N., Bisanz J. E., Turnbaugh P. J., and Balskus E. P., Discovery and Inhibition of an Interspecies Gut Bacterial Pathway for Levodopa Metabolism, Science. (2019) 364, no. 6445, https://doi.org/10.1126/science.aau6323, 2-s2.0-85067609952, 31196984.
10.1126/science.aau6323
PubMed Google Scholar
105 Van Kessel S. P., Frye A. K., El-Gendy A. O., Castejon M., Keshavarzian A., Van Dijk G., and El Aidy S., Gut Bacterial Tyrosine Decarboxylases Restrict Levels of Levodopa in the Treatment of Parkinson’s Disease, Nature Communications. (2019) 10, no. 1, https://doi.org/10.1038/s41467-019-08294-y, 2-s2.0-85060132817, 30659181.
10.1038/s41467-019-08294-y
PubMed Google Scholar
106 Li C., Cui L., Yang Y., Miao J., Zhao X., Zhang J., Cui G., and Zhang Y., Gut Microbiota Differs Between Parkinson’s Disease Patients and Healthy Controls in Northeast China, Frontiers in Molecular Neuroscience. (2019) 12, https://doi.org/10.3389/fnmol.2019.00171, 2-s2.0-85070731209, 31354427.
10.3389/fnmol.2019.00171
PubMed Google Scholar
107 Cosma-Grigorov A., Meixner H., Mrochen A., Wirtz S., Winkler J., and Marxreiter F., Changes in Gastrointestinal Microbiome Composition in PD: A Pivotal Role of Covariates, Frontiers in Neurology. (2020) 11, https://doi.org/10.3389/fneur.2020.01041, 33071933.
10.3389/fneur.2020.01041
PubMed Google Scholar
108 Ren T., Gao Y., Qiu Y., Jiang S., Zhang Q., Zhang J., Wang L., Zhang Y., Wang L., and Nie K., Gut Microbiota Altered in Mild Cognitive Impairment Compared With Normal Cognition in Sporadic Parkinson’s Disease, Frontiers in Neurology. (2020) 11, https://doi.org/10.3389/fneur.2020.00137, 32161568.
10.3389/fneur.2020.00137
PubMed Google Scholar
109 Vascellari S., Palmas V., Melis M., Pisanu S., Cusano R., Uva P., Perra D., Madau V., Sarchioto M., Oppo V., Simola N., Morelli M., Santoru M. L., Atzori L., Melis M., Cossu G., and Manzin A., Gut Microbiota and Metabolome Alterations Associated With Parkinson’s Disease, Msystems. (2020) 5, no. 5, e00561, https://doi.org/10.1128/mSystems.00561-20, 32934117.
10.1128/mSystems.00561-20
CAS PubMed Google Scholar
110 Tan A. H., Chong C. W., Lim S., Yap I. K. S., Teh C. S. J., Loke M. F., Song S., Tan J. Y., Ang B. H., Tan Y. Q., Kho M. T., Bowman J., Mahadeva S., Yong H. S., and Lang A. E., Gut Microbial Ecosystem in Parkinson Disease: New Clinicobiological Insights From Multi-Omics, Annals of Neurology. (2021) 89, no. 3, 546–559, https://doi.org/10.1002/ana.25982, 33274480.
10.1002/ana.25982
CAS PubMed Web of Science® Google Scholar
111 Derrien M., Van Baarlen P., Hooiveld G., Norin E., Müller M., and De Vos W. M., Modulation of Mucosal Immune Response, Tolerance, and Proliferation in Mice Colonized by the Mucin-Degrader Akkermansia muciniphila, Frontiers in Microbiology. (2011) 2, https://doi.org/10.3389/fmicb.2011.00166, 2-s2.0-83655210633, 21904534.
10.3389/fmicb.2011.00166
PubMed Google Scholar
112 Reunanen J., Kainulainen V., Huuskonen L., Ottman N., Belzer C., Huhtinen H., De Vos W. M., and Satokari R., Akkermansia muciniphila Adheres to Enterocytes and Strengthens the Integrity of the Epithelial Cell Layer, Applied and Environmental Microbiology. (2015) 81, no. 11, 3655–3662, https://doi.org/10.1128/AEM.04050-14, 2-s2.0-84930038638, 25795669.
10.1128/AEM.04050-14
CAS PubMed Web of Science® Google Scholar
113 Xu Y., Wang N., Tan H.-Y., Li S., Zhang C., and Feng Y., Function of Akkermansia muciniphila in Obesity: Interactions With Lipid Metabolism, Immune Response and Gut Systems, Frontiers in Microbiology. (2020) 11, https://doi.org/10.3389/fmicb.2020.00219, 32153527.
10.3389/fmicb.2020.00219
PubMed Web of Science® Google Scholar
114 Li J., Yang G., Zhang Q., Liu Z., Jiang X., and Xin Y., Function of Akkermansia muciniphila in Type 2 Diabetes and Related Diseases, Frontiers in Microbiology. (2023) 14, 1172400, https://doi.org/10.3389/fmicb.2023.1172400, 37396381.
10.3389/fmicb.2023.1172400
PubMed Google Scholar
115 Freedman S. N., Shahi S. K., and Mangalam A. K., The “Gut Feeling”: Breaking Down the Role of Gut Microbiome in Multiple Sclerosis, Neurotherapeutics. (2018) 15, no. 1, 109–125, https://doi.org/10.1007/s13311-017-0588-x, 2-s2.0-85036533168, 29204955.
10.1007/s13311-017-0588-x
PubMed Web of Science® Google Scholar
116 Zhai R., Xue X., Zhang L., Yang X., Zhao L., and Zhang C., Strain-Specific Anti-inflammatory Properties of Two Akkermansia muciniphila Strains on Chronic Colitis in Mice, Frontiers in Cellular and Infection Microbiology. (2019) 9, https://doi.org/10.3389/fcimb.2019.00239, 2-s2.0-85072726673, 31334133.
10.3389/fcimb.2019.00239
PubMed Google Scholar
117 Vandeputte D., Falony G., Vieira-Silva S., Tito R. Y., Joossens M., and Raes J., Stool Consistency Is Strongly Associated With Gut Microbiota Richness and Composition, Enterotypes and Bacterial Growth Rates, Gut. (2016) 65, no. 1, 57–62, https://doi.org/10.1136/gutjnl-2015-309618, 2-s2.0-84952942226, 26069274.
10.1136/gutjnl-2015-309618
CAS PubMed Web of Science® Google Scholar
118 Cani P. D., Human Gut Microbiome: Hopes, Threats And Promises, Gut. (2018) 67, no. 9, 1716–1725, https://doi.org/10.1136/gutjnl-2018-316723, 2-s2.0-85049137140, 29934437.
10.1136/gutjnl-2018-316723
CAS PubMed Web of Science® Google Scholar
119 Shetty S. A., Marathe N. P., Lanjekar V., Ranade D., and Shouche Y. S., Comparative Genome Analysis of Megasphaera sp. Reveals Niche Specialization and Its Potential role in the Human Gut, PLoS One. (2013) 8, no. 11, e79353, https://doi.org/10.1371/journal.pone.0079353, 2-s2.0-84894133363, 24260205.
10.1371/journal.pone.0079353
PubMed Google Scholar
120 Cabral L. D. S. and Weimer P. J., Megasphaera elsdenii: Its Role in Ruminant Nutrition and Its Potential Industrial Application for Organic Acid Biosynthesis, Microorganisms. (2024) 12, no. 1, https://doi.org/10.3390/microorganisms12010219, 38276203.
10.3390/microorganisms12010219
PubMed Google Scholar
121 Lee N.-R., Lee C. H., Lee D.-Y., and Park J.-B., Genome-Scale Metabolic Network Reconstruction and In Silico Analysis of Hexanoic acid Producing Megasphaera elsdenii, Microorganisms. (2020) 8, no. 4, https://doi.org/10.3390/microorganisms8040539, 32283671.
10.3390/microorganisms8040539
PubMed Google Scholar
122 Abdik E. and Çakır T., Transcriptome-Based Biomarker Prediction for Parkinson’s Disease Using Genome-Scale Metabolic Modeling, Scientific Reports. (2024) 14, no. 1, https://doi.org/10.1038/s41598-023-51034-y, 38182712.
10.1038/s41598-023-51034-y
PubMed Google Scholar
123 Blais E. M., Rawls K. D., Dougherty B. V., Li Z. I., Kolling G. L., Ye P., Wallqvist A., and Papin J. A., Reconciled Rat and Human Metabolic Networks for Comparative Toxicogenomics and Biomarker Predictions, Nature Communications. (2017) 8, no. 1, https://doi.org/10.1038/ncomms14250, 2-s2.0-85012009478, 28176778.
10.1038/ncomms14250
PubMed Google Scholar
124 Marchandin H., Jumas-Bilak E., Gay B., Teyssier C., Jean-Pierre H., Siméon De Buochberg M., and Carrière C., Phylogenetic Analysis of Some Sporomusa sub-Branch Members Isolated From Human Clinical Specimens: Description of Megasphaera micronuciformis sp. nov, International Journal of Systematic and Evolutionary Microbiology. (2003) 53, no. 2, 547–553, https://doi.org/10.1099/ijs.0.02378-0, 2-s2.0-0037357804.
10.1099/ijs.0.02378-0
CAS PubMed Google Scholar
125 Lubomski M., Xu X., Holmes A. J., Yang J. Y. H., Sue C. M., and Davis R. L., The Impact of device-Assisted Therapies on the Gut Microbiome in Parkinson’s Disease, Journal of Neurology. (2022) 269, no. 2, 780–795, https://doi.org/10.1007/s00415-021-10657-9, 34128115.
10.1007/s00415-021-10657-9
CAS PubMed Google Scholar
126 Chen Y., Zheng H., Zhang G., Chen F., Chen L., and Yang Z., High Oscillospira Abundance Indicates Constipation and Low BMI in the Guangdong Gut Microbiome Project, Scientific Reports. (2020) 10, no. 1, https://doi.org/10.1038/s41598-020-66369-z, 32518316.
10.1038/s41598-020-66369-z
PubMed Web of Science® Google Scholar
127 Kitahara M., Shigeno Y., Shime M., Matsumoto Y., Nakamura S., Motooka D., Fukuoka S., Nishikawa H., and Benno Y., Vescimonas gen. nov., Vescimonas coprocola sp. nov., Vescimonas fastidiosa sp. nov., Pusillimonas gen. nov. and Pusillimonas faecalis sp. nov. isolated from human faeces, International Journal of Systematic and Evolutionary Microbiology. (2021) 71, no. 11, https://doi.org/10.1099/ijsem.0.005066.
10.1099/ijsem.0.005066
PubMed Google Scholar
128 Deyaert S., Moens F., Pirovano W., Van Den Bogert B., Klaassens E. S., Marzorati M., Van De Wiele T., Kleerebezem M., and Van Den Abbeele P., Development of a Reproducible Small Intestinal Microbiota Model and Its Integration Into the SHIME®-System, a Dynamic In Vitro Gut Model, Frontiers in Microbiology. (2023) 13, 1054061, https://doi.org/10.3389/fmicb.2022.1054061.
10.3389/fmicb.2022.1054061
PubMed Google Scholar
129 Zepeda-Rivera M., Minot S. S., Bouzek H., Wu H., Blanco-Míguez A., Manghi P., Jones D. S., LaCourse K. D., Wu Y., McMahon E. F., Park S.-N., Lim Y. K., Kempchinsky A. G., Willis A. D., Cotton S. L., Yost S. C., Sicinska E., Kook J.-K., Dewhirst F. E., Segata N., Bullman S., and Johnston C. D., A Distinct Fusobacterium nucleatum Clade Dominates the Colorectal Cancer Niche, Nature. (2024) 628, no. 8007, 424–432, https://doi.org/10.1038/s41586-024-07182-w.
10.1038/s41586-024-07182-w
CAS PubMed Google Scholar
130 Van Kessel S. P., Bullock A., Van Dijk G., and El Aidy S., Parkinson’s Disease Medication Alters Small Intestinal Motility and Microbiota Composition in Healthy Rats, Msystems. (2022) 7, no. 1, e0119121, https://doi.org/10.1128/msystems.01191-21, 35076270.
10.1128/msystems.01191-21
PubMed Google Scholar
131 Van Den Abbeele P., Deyaert S., Thabuis C., Perreau C., Bajic D., Wintergerst E., Joossens M., Firrman J., Walsh D., and Baudot A., Bridging Preclinical and Clinical Gut Microbiota Research Using the Ex Vivo SIFR Technology, Frontiers in Microbiology. (2023) 14, 1131662, https://doi.org/10.3389/fmicb.2023.1131662, 37187538.
10.3389/fmicb.2023.1131662
PubMed Google Scholar
132 Jansma J., Thome N. U., Schwalbe M., Chatziioannou A. C., Elsayed S. S., Van Wezel G. P., Van Den Abbeele P., Van Hemert S., and El Aidy S., Dynamic Effects of Probiotic Formula Ecologic®825 on Human Small Intestinal Ileostoma Microbiota: A Network Theory Approach, Gut Microbes. (2023) 15, no. 1, 2232506, https://doi.org/10.1080/19490976.2023.2232506, 37417553.
10.1080/19490976.2023.2232506
PubMed Google Scholar
133 Chandra S., Sisodia S. S., and Vassar R. J., The Gut Microbiome in Alzheimer’s Disease: What We Know and What Remains to Be Explored, Molecular Neurodegeneration. (2023) 18, no. 1, https://doi.org/10.1186/s13024-023-00595-7, 36721148.
10.1186/s13024-023-00595-7
PubMed Google Scholar
134 Wallen Z. D., Demirkan A., Twa G., Cohen G., Dean M. N., Standaert D. G., Sampson T. R., and Payami H., Metagenomics of Parkinson’s Disease Implicates the Gut Microbiome in Multiple Disease Mechanisms, Nature Communications. (2022) 13, no. 1, https://doi.org/10.1038/s41467-022-34667-x, 36376318.
10.1038/s41467-022-34667-x
PubMed Google Scholar
135 Díez López C., Montiel González D., Vidaki A., and Kayser M., Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning, Frontiers in Microbiology. (2022) 13, 886201, https://doi.org/10.3389/fmicb.2022.886201, 35928158.
10.3389/fmicb.2022.886201
PubMed Google Scholar
136 Sayyari E., Kawas B., and Mirarab S., TADA: Phylogenetic Augmentation of Microbiome Samples Enhances Phenotype Classification, Bioinformatics. (2019) 35, no. 14, i31–i40, https://doi.org/10.1093/bioinformatics/btz394, 2-s2.0-85068931173, 31510701.
10.1093/bioinformatics/btz394
CAS PubMed Google Scholar
137 Nearing J. T., Douglas G. M., Hayes M. G., Mac Donald J., Desai D. K., Allward N., Jones C. M. A., Wright R. J., Dhanani A. S., Comeau A. M., and Langille M. G. I., Microbiome Differential Abundance Methods Produce Different Results Across 38 Datasets, Nature Communications. (2022) 13, no. 1, https://doi.org/10.1038/s41467-022-28034-z, 35039521.
10.1038/s41467-022-28034-z
PubMed Google Scholar
138 Hernández Medina R., Kutuzova S., Nielsen K. N., Johansen J., Hansen L. H., Nielsen M., and Rasmussen S., Machine Learning and Deep Learning Applications in Microbiome Research, ISME Communications. (2022) 2, no. 1, https://doi.org/10.1038/s43705-022-00182-9, 37938690.
10.1038/s43705-022-00182-9
PubMed Google Scholar
139 Shomorony I., Cirulli E. T., Huang L., Napier L. A., Heister R. R., Hicks M., Cohen I. V., Yu H.-C., Swisher C. L., Schenker-Ahmed N. M., Li W., Nelson K. E., Brar P., Kahn A. M., Spector T. D., Caskey C. T., Venter J. C., Karow D. S., Kirkness E. F., and Shah N., An Unsupervised Learning Approach to Identify Novel Signatures of Health and Disease From Multimodal Data, Genome Medicine. (2020) 12, no. 1, https://doi.org/10.1186/s13073-019-0705-z.
10.1186/s13073-019-0705-z
PubMed Google Scholar
140 Ullmann T., Peschel S., Finger P., Müller C. L., and Boulesteix A.-L., Over-Optimism in Unsupervised Microbiome Analysis: Insights From Network Learning and Clustering, PLoS Computational Biology. (2023) 19, no. 1, e1010820, https://doi.org/10.1371/journal.pcbi.1010820, 36608142.
10.1371/journal.pcbi.1010820
CAS PubMed Google Scholar
141 Schloss P. D., Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research, mBio. (2018) 9, no. 3, e00525, https://doi.org/10.1128/mBio.00525-18, 2-s2.0-85048275080, 29871915.
10.1128/mBio.00525-18
PubMed Google Scholar
142 Abdill R. J., Graham S. P., Rubinetti V., Ahmadian M., Hicks P., Chetty A., McDonald D., Ferretti P., Gibbons E., Rossi M., Krishnan A., Albert F. W., Greene C. S., Davis S., and Blekhman R., Integration of 168, 000 Samples Reveals Global Patterns of the Human Gut Microbiome, Cell. (2025) 188, no. 4, 1100–1118.e17, https://doi.org/10.1016/j.cell.2024.12.017, 39848248.
10.1016/j.cell.2024.12.017
CAS PubMed Google Scholar

All articles

Improved Key Microbial Biomarker Discovery Using Ensemble Statistical Methods

Abstract

1. Introduction

2. Materials and Methods

2.1. KMB Analysis Framework

2.1.1. Data Preprocessing

2.1.2. DAA (Association Testing)

2.1.3. ML Methods

2.1.3.1. LASSO

2.1.3.2. Boruta

2.1.3.3. LEfSe

2.1.4. Biomarker Selection

2.2. Diversity Analysis

2.3. Method Evaluation

2.3.1. Datasets Used for Evaluation

2.3.1.1. CRC Datasets

2.3.1.2. AD Dataset

2.3.1.3. PD Dataset

2.3.2. Performance Assessment

2.3.3. Confounding Effects

3. Results and Discussion

3.1. Evaluations of the CRC 16S rRNA Gene Dataset

3.1.1. Abundance and Diversity Analysis

3.1.2. Performance of the KMB Framework

3.1.3. Performance of the Individual Biomarker Prediction Methods

3.2. Evaluations of the CRC Shotgun Metagenomics Dataset

3.2.1. Abundance and Diversity Analysis

3.2.2. Performance of the KMB Framework

3.2.3. Contributions of the Individual Biomarker Prediction Methods to the KMB Model

3.3. Evaluations of the AD 16S rRNA Gene Dataset

3.3.1. Abundance and Diversity Analysis

3.3.2. Performance of the KMB Framework: Enriched Microbial Genera in Healthy Controls

3.3.3. Performance of the KMB Framework: Enriched Microbial Genera in AD Patients

3.3.4. Summarizing Results on the AD Dataset

3.4. Evaluations of the PD Shotgun Metagenomics Dataset

3.4.1. Abundance and Diversity Analysis

3.4.2. Performance of the KMB Framework: Comparison With the Original Study

3.4.3. Performance of the KMB Framework: Comparison With Other Shotgun Metagenomics Studies

4. Conclusions

Conflicts of Interest

Author Contributions

Funding

Acknowledgments

Supporting Information

Open Research

Data Availability Statement

Supporting Information

References

Figures

References

Related

Information