An integrated Bayesian framework for multi-omics prediction and classification
Corresponding Author
Himel Mallick
Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, 10065 New York, USA
Department of Statistics and Data Science, Cornell University, Ithaca, New York, USA
Correspondence
Himel Mallick, Weill Cornell Medicine, Cornell University, New York, NY, USA.
Email: [email protected]
Erina Paul, Merck & Co., Inc., Rahway, NJ, USA.
Email: [email protected]
Search for more papers by this authorAnupreet Porwal
Department of Statistics, University of Washington, Seattle, Washington, USA
Search for more papers by this authorSatabdi Saha
Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
Search for more papers by this authorPiyali Basak
Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, New Jersey, USA
Search for more papers by this authorVladimir Svetnik
Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, New Jersey, USA
Search for more papers by this authorCorresponding Author
Erina Paul
Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, New Jersey, USA
Correspondence
Himel Mallick, Weill Cornell Medicine, Cornell University, New York, NY, USA.
Email: [email protected]
Erina Paul, Merck & Co., Inc., Rahway, NJ, USA.
Email: [email protected]
Search for more papers by this authorCorresponding Author
Himel Mallick
Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, 10065 New York, USA
Department of Statistics and Data Science, Cornell University, Ithaca, New York, USA
Correspondence
Himel Mallick, Weill Cornell Medicine, Cornell University, New York, NY, USA.
Email: [email protected]
Erina Paul, Merck & Co., Inc., Rahway, NJ, USA.
Email: [email protected]
Search for more papers by this authorAnupreet Porwal
Department of Statistics, University of Washington, Seattle, Washington, USA
Search for more papers by this authorSatabdi Saha
Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
Search for more papers by this authorPiyali Basak
Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, New Jersey, USA
Search for more papers by this authorVladimir Svetnik
Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, New Jersey, USA
Search for more papers by this authorCorresponding Author
Erina Paul
Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, New Jersey, USA
Correspondence
Himel Mallick, Weill Cornell Medicine, Cornell University, New York, NY, USA.
Email: [email protected]
Erina Paul, Merck & Co., Inc., Rahway, NJ, USA.
Email: [email protected]
Search for more papers by this authorAbstract
With the growing commonality of multi-omics datasets, there is now increasing evidence that integrated omics profiles lead to more efficient discovery of clinically actionable biomarkers that enable better disease outcome prediction and patient stratification. Several methods exist to perform host phenotype prediction from cross-sectional, single-omics data modalities but decentralized frameworks that jointly analyze multiple time-dependent omics data to highlight the integrative and dynamic impact of repeatedly measured biomarkers are currently limited. In this article, we propose a novel Bayesian ensemble method to consolidate prediction by combining information across several longitudinal and cross-sectional omics data layers. Unlike existing frequentist paradigms, our approach enables uncertainty quantification in prediction as well as interval estimation for a variety of quantities of interest based on posterior summaries. We apply our method to four published multi-omics datasets and demonstrate that it recapitulates known biology in addition to providing novel insights while also outperforming existing methods in estimation, prediction, and uncertainty quantification. Our open-source software is publicly available at https://github.com/himelmallick/IntegratedLearner.
Open Research
DATA AVAILABILITY STATEMENT
The implementation of IntegratedLearner is publicly available with source code, documentation, tutorial, and as an R/Bioconductor package at https://github.com/himelmallick/IntegratedLearner. Analysis scripts for synthetic benchmarking and real data analyses are available from the first author upon request. Previously published data used in this study are appropriately cited in the main text as well as in the References section. The detailed data summary is provided in Table 1. All processed data for the four case studies are available at https://github.com/himelmallick/IntegratedLearner.
Supporting Information
Filename | Description |
---|---|
sim9953-sup-0001-OnlineAppendix.zipZip archive, 682.2 KB | Appendix S1. Supporting information |
Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
REFERENCES
- 1Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol. 2017; 18(1): 1-15.
- 2Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insight. 2020; 14: 1-24.
- 3Singh A, Shannon CP, Gautier B, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019; 35(17): 3055-3062.
- 4Ghaemi MS, DiGiulio DB, Contrepois K, et al. Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy. Bioinformatics. 2019; 35(1): 95-103.
- 5Stelzer IA, Ghaemi MS, Han X, et al. Integrated trajectories of the maternal metabolome, proteome, and immunome predict labor onset. Sci Transl Med. 2021; 13(592):eabd9898.
- 6Franzosa EA, Sirota-Madi A, Avila-Pacheco J, et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol. 2019; 4(2): 293-305.
- 7Zhang L, Lv C, Jin Y, et al. Deep learning-based multi-omics data integration reveals two prognostic subtypes in high-risk neuroblastoma. Front Genet. 2018; 9: 477.
- 8Nicora G, Vitali F, Dagliati A, Geifman N, Bellazzi R. Integrated multi-omics analyses in oncology: a review of machine learning methods and tools. Front Oncol. 2020; 10: 1030.
- 9Culos A, Tsai AS, Stanley N, et al. Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions. Nature Mach Intell. 2020; 2(10): 619-628.
- 10Kaufman S, Rosset S, Perlich C, Stitelman O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans Knowl Discov Data. 2012; 6(4): 1-21.
- 11Chipman HA, George EI, McCulloch RE, et al. BART: Bayesian additive regression trees. Ann Appl Stat. 2010; 4(1): 266-298.
- 12dMJ LV, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007; 6(1):25.
- 13Tan YV, Roy J. Bayesian additive regression trees and the general BART model. Stat Med. 2019; 38(25): 5048-5069.
- 14Horrocks J, Van MJ, Heuvel D. Prediction of pregnancy: a joint model for longitudinal and binary data. Bayesian Anal. 2009; 4(3): 523-538.
- 15Naimi AI, Balzer LB. Stacked generalization: an introduction to super learning. Eur J Epidemiol. 2018; 33(5): 459-464.
- 16Lloyd-Price J, Arze C, Ananthakrishnan AN, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019; 569(7758): 655-662.
- 17Ma S, Shungin D, Mallick H, et al. Population structure discovery in meta-analyzed microbial communities and inflammatory bowel disease using MMUPHin. Genome Biol. 2022; 23(1): 1-31.
- 18Mallick H, Rahnavard A, McIver LJ, et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol. 2021; 17(11):e1009442.
- 19Krassowski M, Das V, Sahu SK, Misra BB. State of the field in multi-omics research: from computational needs to data mining and sharing. Front Genet. 2020; 11:610798.
- 20Chalise P, Raghavan R, Fridley BL. InterSIM: simulation tool for multiple integrative ‘omic datasets’. Comput Methods Programs Biomed. 2016; 128: 69-74.
- 21Friedman JH. Multivariate adaptive regression splines. Ann Stat. 1991; 19(1): 1-67.
- 22Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017; 18(1): 174.
- 23Mallick H, Chatterjee S, Chowdhury S, Chatterjee S, Rahnavard A, Hicks SC. Differential expression of single-cell RNA-seq data using Tweedie models. Stat Med. 2022; 41(18): 3492-3510.
- 24Kapelner A, Bleich J. bartMachine: machine learning with Bayesian additive regression trees. arXiv preprint arXiv:1312.2171. 2013.
- 25Ren B, Patil P, Dominici F, Parmigiani G, Trippa L. Cross-study learning for generalist and specialist predictions. arXiv preprint arXiv:2007.12807. 2020.
- 26Kapelner A, Bleich J. Prediction with missing data via Bayesian additive regression trees. Can J Stat. 2015; 43(2): 224-239.
- 27Gaines BR, Kim J, Zhou H. Algorithms for fitting the constrained lasso. J Comput Graph Stat. 2018; 27(4): 861-871.
- 28Canales-Rodr EJ, Pizzolato M, Yu T, et al. Revisiting the T2 spectrum imaging inverse problem: Bayesian regularized non-negative least squares. Neuroimage. 2021; 244:118582.
- 29Linero AR, Yang Y. Bayesian regression tree ensembles that adapt to smoothness and sparsity. J R Stat Soc Series B Stat Methodology. 2018; 80(5): 1087-1110.
- 30Bleich J, Kapelner A, George EI, Jensen ST. Variable selection for BART: an application to gene regulation. Ann Appl Stat. 2014; 8: 1750-1781.
- 31Liu Y, Ročková V. Variable selection via Thompson sampling. J Am Stat Assoc. 2023; 118(541): 287-304.
- 32Zhang S, Shih YCT, Müller P, et al. A spatially-adjusted Bayesian additive regression tree model to merge two datasets. Bayesian Anal. 2007; 2(3): 611-633.
- 33Shafer G, Vovk V. A tutorial on conformal prediction. J Mach Learn Res. 2008; 9(12):371-421.
- 34Xu Y, Liaw A, Sheridan RP, Svetnik V. Development and evaluation of conformal prediction methods for QSAR. arXiv preprint arXiv:2304.00970. 2023.
- 35Rynazal R, Fujisawa K, Shiroma H, et al. Leveraging explainable AI for gut microbiome-based colorectal cancer classification. Genome Biol. 2023; 24(1): 1-13.
- 36Spanbauer C, Sparapani R. Nonparametric machine learning for precision medicine with longitudinal clinical trials and Bayesian additive regression trees with mixed models. Stat Med. 2021; 40(11): 2665-2691.
- 37Hajjem A, Bellavance F, Larocque D. Mixed-effects random forest for clustered data. J Stat Comput Simul. 2014; 84(6): 1313-1328.
- 38Murray JS. Log-linear Bayesian additive regression trees for multinomial logistic and count regression models. J Am Stat Assoc. 2021; 116(534): 756-769.
- 39Sparapani RA, Logan BR, McCulloch RE, Laud PW. Nonparametric survival analysis using Bayesian additive regression trees (BART). Stat Med. 2016; 35(16): 2741-2753.
- 40Kindo B, Wang H, Pe E. MBACT–multiclass Bayesian additive classification trees. arXiv preprint arXiv:1309.7821. 2013.
- 41Ding DY, Li S, Narasimhan B, Tibshirani R. Cooperative learning for multiview analysis. Proc Natl Acad Sci. 2022; 119(38):e2202113119.
- 42Ferrari F, Dunson DB. Bayesian factor analysis for inference on interactions. J Am Stat Assoc. 2021; 116(535): 1521-1532.
- 43Argelaguet R, Velten B, Arnol D, et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018; 14(6):e8124.
- 44Argelaguet R, Arnol D, Bredikhin D, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020; 21(1): 1-17.
- 45Bing X, Lovelace T, Bunea F, et al. Essential regression: a generalizable framework for inferring causal latent factors from multi-omic datasets. Patterns. 2022; 3(5):100473.
10.1016/j.patter.2022.100473 Google Scholar
- 46Capitaine L, Genuer R, Thiébaut R. Random forests for high-dimensional longitudinal data. Stat Methods Med Res. 2021; 30(1): 166-184.
- 47Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nature Comput Sci. 2021; 1(6): 395-402.
- 48Lucarelli N, Yun D, Han D, et al. Discovery of novel digital biomarkers for type 2 diabetic nephropathy classification via integration of urinary proteomics and pathology. medRxiv. 2023; 2023:2004.
- 49Miao Z, Humphreys BD, McMahon AP, Kim J. Multi-omics integration in the age of million single-cell data. Nat Rev Nephrol. 2021; 17(11): 710-724.
- 50Zhang M, Liu S, Miao Z, Han F, Gottardo R, Sun W. IDEAS: individual level differential expression analysis for single-cell RNA-seq data. Genome Biol. 2022; 23(1): 1-17.
- 51Burgess DJ. Spatial transcriptomics coming of age. Nat Rev Genet. 2019; 20(6): 317.
- 52Barsoum I, Tawedrous E, Faragalla H, Yousef GM. Histo-genomics: digital pathology at the forefront of precision medicine. Diagnosis. 2019; 6(3): 203-212.
- 53Palla G, Spitzer H, Klein M, et al. Squidpy: a scalable framework for spatial omics analysis. Nat Methods. 2022; 19(2): 171-178.
- 54Hu J, Schroeder A, Coleman K, Chen C, Auerbach BJ, Li M. Statistical and machine learning methods for spatially resolved transcriptomics with histology. Comput Struct Biotechnol J. 2021; 19: 3829-3841.
- 55Xu Y, McCord RP. Diagonal integration of multimodal single-cell data: potential pitfalls and paths forward. Nat Commun. 2022; 13(1): 3505.