With the growing commonality of multi-omics datasets, there is now increasing evidence that integrated omics profiles lead to more efficient discovery of clinically actionable biomarkers that enable better disease outcome prediction and patient stratification. Several methods exist to perform host phenotype prediction from cross-sectional, single-omics data modalities but decentralized frameworks that jointly analyze multiple time-dependent omics data to highlight the integrative and dynamic impact of repeatedly measured biomarkers are currently limited. In this article, we propose a novel Bayesian ensemble method to consolidate prediction by combining information across several longitudinal and cross-sectional omics data layers. Unlike existing frequentist paradigms, our approach enables uncertainty quantification in prediction as well as interval estimation for a variety of quantities of interest based on posterior summaries. We apply our method to four published multi-omics datasets and demonstrate that it recapitulates known biology in addition to providing novel insights while also outperforming existing methods in estimation, prediction, and uncertainty quantification. Our open-source software is publicly available at https://github.com/himelmallick/IntegratedLearner.

Open Research

DATA AVAILABILITY STATEMENT

The implementation of IntegratedLearner is publicly available with source code, documentation, tutorial, and as an R/Bioconductor package at https://github.com/himelmallick/IntegratedLearner. Analysis scripts for synthetic benchmarking and real data analyses are available from the first author upon request. Previously published data used in this study are appropriately cited in the main text as well as in the References section. The detailed data summary is provided in Table 1. All processed data for the four case studies are available at https://github.com/himelmallick/IntegratedLearner.

Supporting Information

REFERENCES

1Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol. 2017; 18(1): 1-15.
10.1186/s13059-017-1215-1
PubMed Web of Science® Google Scholar
2Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insight. 2020; 14: 1-24.
10.1177/1177932219899051
Web of Science® Google Scholar
3Singh A, Shannon CP, Gautier B, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019; 35(17): 3055-3062.
10.1093/bioinformatics/bty1054
CAS PubMed Web of Science® Google Scholar
4Ghaemi MS, DiGiulio DB, Contrepois K, et al. Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy. Bioinformatics. 2019; 35(1): 95-103.
10.1093/bioinformatics/bty537
CAS PubMed Web of Science® Google Scholar
5Stelzer IA, Ghaemi MS, Han X, et al. Integrated trajectories of the maternal metabolome, proteome, and immunome predict labor onset. Sci Transl Med. 2021; 13(592):eabd9898.
10.1126/scitranslmed.abd9898
PubMed Web of Science® Google Scholar
6Franzosa EA, Sirota-Madi A, Avila-Pacheco J, et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol. 2019; 4(2): 293-305.
10.1038/s41564-018-0306-4
CAS PubMed Web of Science® Google Scholar
7Zhang L, Lv C, Jin Y, et al. Deep learning-based multi-omics data integration reveals two prognostic subtypes in high-risk neuroblastoma. Front Genet. 2018; 9: 477.
10.3389/fgene.2018.00477
CAS PubMed Web of Science® Google Scholar
8Nicora G, Vitali F, Dagliati A, Geifman N, Bellazzi R. Integrated multi-omics analyses in oncology: a review of machine learning methods and tools. Front Oncol. 2020; 10: 1030.
10.3389/fonc.2020.01030
PubMed Web of Science® Google Scholar
9Culos A, Tsai AS, Stanley N, et al. Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions. Nature Mach Intell. 2020; 2(10): 619-628.
10.1038/s42256-020-00232-8
PubMed Google Scholar
10Kaufman S, Rosset S, Perlich C, Stitelman O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans Knowl Discov Data. 2012; 6(4): 1-21.
10.1145/2382577.2382579
Web of Science® Google Scholar
11Chipman HA, George EI, McCulloch RE, et al. BART: Bayesian additive regression trees. Ann Appl Stat. 2010; 4(1): 266-298.
10.1214/09-AOAS285
Web of Science® Google Scholar
12dMJ LV, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007; 6(1):25.
Google Scholar
13Tan YV, Roy J. Bayesian additive regression trees and the general BART model. Stat Med. 2019; 38(25): 5048-5069.
10.1002/sim.8347
PubMed Web of Science® Google Scholar
14Horrocks J, Van MJ, Heuvel D. Prediction of pregnancy: a joint model for longitudinal and binary data. Bayesian Anal. 2009; 4(3): 523-538.
10.1214/09-BA419
Web of Science® Google Scholar
15Naimi AI, Balzer LB. Stacked generalization: an introduction to super learning. Eur J Epidemiol. 2018; 33(5): 459-464.
10.1007/s10654-018-0390-z
CAS PubMed Web of Science® Google Scholar
16Lloyd-Price J, Arze C, Ananthakrishnan AN, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019; 569(7758): 655-662.
10.1038/s41586-019-1237-9
CAS PubMed Web of Science® Google Scholar
17Ma S, Shungin D, Mallick H, et al. Population structure discovery in meta-analyzed microbial communities and inflammatory bowel disease using MMUPHin. Genome Biol. 2022; 23(1): 1-31.
10.1186/s13059-022-02753-4
PubMed Web of Science® Google Scholar
18Mallick H, Rahnavard A, McIver LJ, et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol. 2021; 17(11):e1009442.
10.1371/journal.pcbi.1009442
CAS PubMed Web of Science® Google Scholar
19Krassowski M, Das V, Sahu SK, Misra BB. State of the field in multi-omics research: from computational needs to data mining and sharing. Front Genet. 2020; 11:610798.
10.3389/fgene.2020.610798
CAS PubMed Web of Science® Google Scholar
20Chalise P, Raghavan R, Fridley BL. InterSIM: simulation tool for multiple integrative ‘omic datasets’. Comput Methods Programs Biomed. 2016; 128: 69-74.
10.1016/j.cmpb.2016.02.011
PubMed Web of Science® Google Scholar
21Friedman JH. Multivariate adaptive regression splines. Ann Stat. 1991; 19(1): 1-67.
10.1214/aos/1176347963
Web of Science® Google Scholar
22Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017; 18(1): 174.
10.1186/s13059-017-1305-0
PubMed Google Scholar
23Mallick H, Chatterjee S, Chowdhury S, Chatterjee S, Rahnavard A, Hicks SC. Differential expression of single-cell RNA-seq data using Tweedie models. Stat Med. 2022; 41(18): 3492-3510.
10.1002/sim.9430
PubMed Web of Science® Google Scholar
24Kapelner A, Bleich J. bartMachine: machine learning with Bayesian additive regression trees. arXiv preprint arXiv:1312.2171. 2013.
Google Scholar
25Ren B, Patil P, Dominici F, Parmigiani G, Trippa L. Cross-study learning for generalist and specialist predictions. arXiv preprint arXiv:2007.12807. 2020.
Google Scholar
26Kapelner A, Bleich J. Prediction with missing data via Bayesian additive regression trees. Can J Stat. 2015; 43(2): 224-239.
10.1002/cjs.11248
Web of Science® Google Scholar
27Gaines BR, Kim J, Zhou H. Algorithms for fitting the constrained lasso. J Comput Graph Stat. 2018; 27(4): 861-871.
10.1080/10618600.2018.1473777
PubMed Web of Science® Google Scholar
28Canales-Rodr EJ, Pizzolato M, Yu T, et al. Revisiting the T2 spectrum imaging inverse problem: Bayesian regularized non-negative least squares. Neuroimage. 2021; 244:118582.
10.1016/j.neuroimage.2021.118582
PubMed Web of Science® Google Scholar
29Linero AR, Yang Y. Bayesian regression tree ensembles that adapt to smoothness and sparsity. J R Stat Soc Series B Stat Methodology. 2018; 80(5): 1087-1110.
10.1111/rssb.12293
Web of Science® Google Scholar
30Bleich J, Kapelner A, George EI, Jensen ST. Variable selection for BART: an application to gene regulation. Ann Appl Stat. 2014; 8: 1750-1781.
10.1214/14-AOAS755
Web of Science® Google Scholar
31Liu Y, Ročková V. Variable selection via Thompson sampling. J Am Stat Assoc. 2023; 118(541): 287-304.
10.1080/01621459.2021.1928514
CAS Web of Science® Google Scholar
32Zhang S, Shih YCT, Müller P, et al. A spatially-adjusted Bayesian additive regression tree model to merge two datasets. Bayesian Anal. 2007; 2(3): 611-633.
10.1214/07-BA224
Web of Science® Google Scholar
33Shafer G, Vovk V. A tutorial on conformal prediction. J Mach Learn Res. 2008; 9(12):371-421.
Google Scholar
34Xu Y, Liaw A, Sheridan RP, Svetnik V. Development and evaluation of conformal prediction methods for QSAR. arXiv preprint arXiv:2304.00970. 2023.
Google Scholar
35Rynazal R, Fujisawa K, Shiroma H, et al. Leveraging explainable AI for gut microbiome-based colorectal cancer classification. Genome Biol. 2023; 24(1): 1-13.
10.1186/s13059-023-02858-4
PubMed Web of Science® Google Scholar
36Spanbauer C, Sparapani R. Nonparametric machine learning for precision medicine with longitudinal clinical trials and Bayesian additive regression trees with mixed models. Stat Med. 2021; 40(11): 2665-2691.
10.1002/sim.8924
PubMed Web of Science® Google Scholar
37Hajjem A, Bellavance F, Larocque D. Mixed-effects random forest for clustered data. J Stat Comput Simul. 2014; 84(6): 1313-1328.
10.1080/00949655.2012.741599
Web of Science® Google Scholar
38Murray JS. Log-linear Bayesian additive regression trees for multinomial logistic and count regression models. J Am Stat Assoc. 2021; 116(534): 756-769.
10.1080/01621459.2020.1813587
CAS Web of Science® Google Scholar
39Sparapani RA, Logan BR, McCulloch RE, Laud PW. Nonparametric survival analysis using Bayesian additive regression trees (BART). Stat Med. 2016; 35(16): 2741-2753.
10.1002/sim.6893
PubMed Web of Science® Google Scholar
40Kindo B, Wang H, Pe E. MBACT–multiclass Bayesian additive classification trees. arXiv preprint arXiv:1309.7821. 2013.
Google Scholar
41Ding DY, Li S, Narasimhan B, Tibshirani R. Cooperative learning for multiview analysis. Proc Natl Acad Sci. 2022; 119(38):e2202113119.
10.1073/pnas.2202113119
CAS PubMed Web of Science® Google Scholar
42Ferrari F, Dunson DB. Bayesian factor analysis for inference on interactions. J Am Stat Assoc. 2021; 116(535): 1521-1532.
10.1080/01621459.2020.1745813
CAS PubMed Web of Science® Google Scholar
43Argelaguet R, Velten B, Arnol D, et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018; 14(6):e8124.
10.15252/msb.20178124
PubMed Web of Science® Google Scholar
44Argelaguet R, Arnol D, Bredikhin D, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020; 21(1): 1-17.
10.1186/s13059-020-02015-1
Web of Science® Google Scholar
45Bing X, Lovelace T, Bunea F, et al. Essential regression: a generalizable framework for inferring causal latent factors from multi-omic datasets. Patterns. 2022; 3(5):100473.
10.1016/j.patter.2022.100473
Google Scholar
46Capitaine L, Genuer R, Thiébaut R. Random forests for high-dimensional longitudinal data. Stat Methods Med Res. 2021; 30(1): 166-184.
10.1177/0962280220946080
PubMed Web of Science® Google Scholar
47Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nature Comput Sci. 2021; 1(6): 395-402.
10.1038/s43588-021-00086-z
PubMed Google Scholar
48Lucarelli N, Yun D, Han D, et al. Discovery of novel digital biomarkers for type 2 diabetic nephropathy classification via integration of urinary proteomics and pathology. medRxiv. 2023; 2023:2004.
Google Scholar
49Miao Z, Humphreys BD, McMahon AP, Kim J. Multi-omics integration in the age of million single-cell data. Nat Rev Nephrol. 2021; 17(11): 710-724.
10.1038/s41581-021-00463-x
PubMed Web of Science® Google Scholar
50Zhang M, Liu S, Miao Z, Han F, Gottardo R, Sun W. IDEAS: individual level differential expression analysis for single-cell RNA-seq data. Genome Biol. 2022; 23(1): 1-17.
10.1186/s13059-022-02605-1
PubMed Web of Science® Google Scholar
51Burgess DJ. Spatial transcriptomics coming of age. Nat Rev Genet. 2019; 20(6): 317.
10.1038/s41576-019-0129-z
CAS PubMed Web of Science® Google Scholar
52Barsoum I, Tawedrous E, Faragalla H, Yousef GM. Histo-genomics: digital pathology at the forefront of precision medicine. Diagnosis. 2019; 6(3): 203-212.
10.1515/dx-2018-0064
PubMed Web of Science® Google Scholar
53Palla G, Spitzer H, Klein M, et al. Squidpy: a scalable framework for spatial omics analysis. Nat Methods. 2022; 19(2): 171-178.
10.1038/s41592-021-01358-2
CAS PubMed Web of Science® Google Scholar
54Hu J, Schroeder A, Coleman K, Chen C, Auerbach BJ, Li M. Statistical and machine learning methods for spatially resolved transcriptomics with histology. Comput Struct Biotechnol J. 2021; 19: 3829-3841.
10.1016/j.csbj.2021.06.052
CAS PubMed Web of Science® Google Scholar
55Xu Y, McCord RP. Diagonal integration of multimodal single-cell data: potential pitfalls and paths forward. Nat Commun. 2022; 13(1): 3505.
10.1038/s41467-022-31104-x
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume43, Issue5

28 February 2024

Pages 983-1002

An integrated Bayesian framework for multi-omics prediction and classification

Abstract

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

An integrated Bayesian framework for multi-omics prediction and classification

Abstract

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

References

Related

Information