Differential expression of single-cell RNA-seq data using Tweedie models
Corresponding Author
Himel Mallick
Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, Rahway, New Jersey, USA
Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.
Email: [email protected]
Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.
Email: [email protected]
Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.
Email: [email protected]
Search for more papers by this authorSuvo Chatterjee
Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, USA
Search for more papers by this authorShrabanti Chowdhury
Department of Genetics and Genomic Sciences and Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Search for more papers by this authorSaptarshi Chatterjee
Department of Statistics, Data and Analytics, Eli Lilly & Company, Indianapolis, Indianapolis, Indiana, USA
Search for more papers by this authorCorresponding Author
Ali Rahnavard
Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.
Email: [email protected]
Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.
Email: [email protected]
Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.
Email: [email protected]
Search for more papers by this authorCorresponding Author
Stephanie C. Hicks
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.
Email: [email protected]
Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.
Email: [email protected]
Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.
Email: [email protected]
Search for more papers by this authorCorresponding Author
Himel Mallick
Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, Rahway, New Jersey, USA
Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.
Email: [email protected]
Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.
Email: [email protected]
Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.
Email: [email protected]
Search for more papers by this authorSuvo Chatterjee
Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, USA
Search for more papers by this authorShrabanti Chowdhury
Department of Genetics and Genomic Sciences and Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, New York, USA
Search for more papers by this authorSaptarshi Chatterjee
Department of Statistics, Data and Analytics, Eli Lilly & Company, Indianapolis, Indianapolis, Indiana, USA
Search for more papers by this authorCorresponding Author
Ali Rahnavard
Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.
Email: [email protected]
Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.
Email: [email protected]
Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.
Email: [email protected]
Search for more papers by this authorCorresponding Author
Stephanie C. Hicks
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.
Email: [email protected]
Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.
Email: [email protected]
Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.
Email: [email protected]
Search for more papers by this authorFunding information: Bill and Melinda Gates Foundation, Grant/Award Number: INV-016930; Division of Environmental Biology, Grant/Award Number: DEB-2028280; National Human Genome Research Institute, Grant/Award Number: R00HG009007
Abstract
The performance of computational methods and software to identify differentially expressed features in single-cell RNA-sequencing (scRNA-seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA-seq expression features. To model the technological variability in cross-platform scRNA-seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA-seq expression profiles across experimental platforms induced by platform- and gene-specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero-inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero-inflated scRNA-seq data with excessive zero counts. Using both synthetic and published plate- and droplet-based scRNA-seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state-of-the-art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open-source software (R/Bioconductor package) is available at https://github.com/himelmallick/Tweedieverse.
Open Research
DATA AVAILABILITY STATEMENT
Previously published data used in this study are appropriately cited in the main text as well as in the References section. The detailed data summary is provided in Table S1. Unless otherwise noted, most of the corresponding annotated digital expression matrices are available from the NCBI Gene Expression Omnibus database. In addition, analysis scripts to process and analyse these datasets are available at https://github.com/himelmallick/Tweedie_SingleCell
Supporting Information
Filename | Description |
---|---|
sim9430-sup-0001-suppinfo.pdfPDF document, 215.6 KB | Data S1 Supporting Information |
Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
REFERENCES
- 1Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018; 19(4): 562-578. doi:10.1093/biostatistics/kxx053
- 2Ding J, Adiconis X, Simmons SK, et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol. 2020; 38(6): 737-746.
- 3Chen G, Ning B, Shi T. Single-cell RNA-seq technologies and related computational data analysis. Front Genet. 2019; 10: 317.
- 4Zheng GX, Terry JM, Belgrader P, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017; 8(1): 1-12.
- 5Macosko EZ, Basu A, Satija R, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161(5): 1202-1214.
- 6Islam S, Zeisel A, Joost S, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014; 11(2): 163-166. doi:10.1038/nmeth.2772
- 7Grün D, Kester L, van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014; 11(6): 637-640. doi:10.1038/nmeth.2930
- 8Picelli S, Faridani OR, Björklund ÅK, Winberg G, Sagasser S, Sandberg R. Full-length RNA-seq from single cells using smart-seq2. Nat Protoc. 2014; 9(1): 171-181.
- 9Pollen AA, Nowakowski TJ, Shuga J, et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 2014; 32(10): 1053.
- 10Vieth B, Ziegenhain C, Parekh S, Enard W, Hellmann I. powsimR: power analysis for bulk and single cell RNA-seq experiments. Bioinformatics. 2017; 33(21): 3486-3488. doi:10.1093/bioinformatics/btx435
- 11Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 2019; 20(1): 1-16.
- 12Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019; 20(1): 1-15.
- 13Svensson V. Droplet scRNA-seq is not zero-inflated. Nat Biotechnol. 2020; 38(2): 147-150.
- 14Cao Y, Kitanovski S, Küppers R, Hoffmann D. UMI or not UMI, that is the question for scRNA-seq zero-inflation. Nat Biotechnol. 2021; 39(2): 158-159.
- 15Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat Methods. 2013; 10(12): 1200-1202.
- 16Korthauer KD, Chu LF, Newton MA, et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 2016; 17(1): 222.
- 17Soneson C, Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods. 2018; 15(4): 255.
- 18Finak G, McDavid A, Yajima M, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015; 16(1): 1-13.
- 19Sekula M, Gaskins J, Datta S. Detection of differentially expressed genes in discrete single-cell RNA sequencing data using a hurdle model with correlated random effects. Biometrics. 2019; 75(4): 1051-1062.
- 20Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun. 2018; 9(1): 284. doi:10.1038/s41467-017-02554-5
- 21Alessandrı̀ L, Arigoni M, Calogero R. Differential expression analysis in single-cell transcriptomics. Methods Mol Biol. 1979; 2019: 425-432.
- 22Hie B, Peters J, Nyquist SK, Shalek AK, Berger B, Bryson BD. Computational methods for single-cell RNA sequencing. Annu Rev Biomed Data Sci. 2020; 3: 339-364.
10.1146/annurev-biodatasci-012220-100601 Google Scholar
- 23Van Buren E, Hu M, Weng C, et al. TWO-SIGMA: a novel two-component single cell model-based association method for single-cell RNA-seq data. Genet Epidemiol. 2021; 45(2): 142-153.
- 24Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics. 2018; 34(18): 3223-3224.
- 25Hu MC, Pavlicova M, Nunes EV. Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. Am J Drug Alcohol Abuse. 2011; 37(5): 367-375.
- 26Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1): 139-140.
- 27Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12): 550.
- 28Hawinkel S, Rayner J, Bijnens L, Thas O. Sequence count data are poorly fit by the negative binomial distribution. PLoS One. 2020; 15(4):e0224909.
- 29Zappia L, Phipson B, Oshlack A. Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database. PLoS Comput Biol. 2018; 14(6):e1006245. doi:10.1371/journal.pcbi.1006245
- 30Zhang Y. Likelihood-based and Bayesian methods for Tweedie compound Poisson linear mixed models. Stat Comput. 2013; 23: 743-757.
- 31Tweedie MC. An index which distinguishes between some important exponential families, 579. 1984.
- 32Jørgensen B. Exponential dispersion models. J Royal Stat Soc Ser B (Methodol). 1987; 49(2): 127-145.
- 33Kurz C. Tweedie distributions for fitting semicontinuous health care utilization cost data. BMC Med Res Methodol. 2017; 17: 171.
- 34Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet. 2021; 53(6): 770-777.
- 35van der Berge K, Perraudeau F, Soneson C, et al. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol. 2018; 19(1): 24. doi:10.1186/s13059-018-1406-4
- 36McCullagh P, Nelder J. Generalized Linear Models. 2nd ed. Boca Raton, FL: Chapman & Hall; 1989.
- 37Cox D, Reid N. Parameter orthogonality and approximate conditional inference. J Royal Stat Soc Ser B. 2017; 49(1): 1-139.
- 38Dunn P, Smyth G. Evaluation of Tweedie exponential dispersion model densities by Fourier inversion. Stat Comput. 2007; 18: 73-86.
- 39Dunn PK, Smyth GK. Series evaluation of Tweedie exponential dispersion models. Stat Comput. 2005; 15: 267-280.
- 40Dunn PK, Smyth GK. Evaluation of Tweedie exponential dispersion models using Fourier inversion. Stat Comput. 2008; 18: 73-86.
- 41Bonat WH, Kokonendji CC. Flexible Tweedie regression models for continuous data. J Stat Comput Simul. 2017; 87(11): 2138-2152.
- 42Ma Y, Sun S, Shang X, Keller ET, Chen M, Zhou X. Integrative differential expression and gene set enrichment analysis using summary statistics for scRNA-seq studies. Nat Commun. 2020; 11(1): 1-13.
- 43Giner G, Smyth GK. statmod: probability calculations for the inverse Gaussian distribution. R J. 2016; 8(1): 339-351.
- 44Amezquita RA, Lun ATL, Becht E, et al. Orchestrating single-cell analysis with bioconductor. Nat Methods. 2020; 17(2): 137-145. doi:10.1038/s41592-019-0654-x
- 45Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016; 17(1): 1-14.
- 46Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019; 15(6):e8746.
- 47Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001; 29(4): 1165-1188.
- 48Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Ser B (Methodol). 1995; 57(1): 289-300.
10.1111/j.2517-6161.1995.tb02031.x Google Scholar
- 49Assefa AT, Vandesompele J, Thas O. SPsimSeq: semi-parametric simulation of bulk and single-cell RNA-sequencing data. Bioinformatics. 2020; 36(10): 3276-3278.
- 50Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 2008; 9(1): 1-13.
- 51Cohen J. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates; 2013.
10.4324/9780203771587 Google Scholar
- 52Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017; 18(1): 1-15.
- 53Islam S, Kjällquist U, Moliner A, et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 2011; 21(7): 1160-1167.
- 54Klein AM, Mazutis L, Akartuna I, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015; 161(5): 1187-1201.
- 55Trapnell C, Cacchiarelli D, Grimsby J, et al. Pseudo-temporal ordering of individual cells reveals dynamics and regulators of cell fate decisions. Nat Biotechnol. 2014; 32(4): 381.
- 56Wu Z, Zhang Y, Stitzel ML, Wu H. Two-phase differential expression analysis for single cell RNA-seq. Bioinformatics. 2018; 34(19): 3340-3348.
- 57Darmanis S, Sloan SA, Zhang Y, et al. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci. 2015; 112(23): 7285-7290.
- 58Petropoulos S, Edsgärd D, Reinius B, et al. Single-cell RNA-seq reveals lineage and X chromosome dynamics in human preimplantation embryos. Cell. 2016; 165(4): 1012-1026.
- 59Snyder BN, Cho Y, Qian Y, Coad JE, Flynn DC, Cunnick JM. AFAP1L1 is a novel adaptor protein of the AFAP family that interacts with cortactin and localizes to invadosomes. Eur J Cell Biol. 2011; 90(5): 376-389.
- 60Furu M, Kajita Y, Nagayama S, et al. Identification of AFAP1L1 as a prognostic marker for spindle cell sarcomas. Oncogene. 2011; 30(38): 4015-4025.
- 61Beiter RM, Fernández-Castañeda A, Rivet-Noor C, et al. Evidence for oligodendrocyte progenitor cell heterogeneity in the adult mouse brain. bioRxiv; 2020.
- 62He X, Cheng R, Benyajati S, Jx M. PEDF and its roles in physiological and pathological conditions: implication in diabetic and hypoxia-induced angiogenic diseases. Clin Sci. 2015; 128(11): 805-823.
- 63Ek ET, Dass CR, Choong PF. PEDF: a potential molecular therapeutic target with multiple anti-cancer activities. Trends Mol Med. 2006; 12(10): 497-502.
- 64Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 2017; 18(1): 1-14.
- 65Park J, Shrestha R, Qiu C, et al. Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science. 2018; 360(6390): 758-763.
- 66Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat Commun. 2019; 10(1): 1-9.
- 67Svensson V, Natarajan KN, Ly LH, et al. Power analysis of single-cell RNA-sequencing experiments. Nat Methods. 2017; 14(4): 381-387.
- 68Qin F, Luo X, Xiao F, Cai G. SCRIP: an accurate simulator for single-cell RNA sequencing data. Bioinformatics. 2022; 38(5): 1304-1311.
- 69Crowell HL, Leonardo SXM, Soneson C, Robinson MD. Built on sand: the shaky foundations of simulating single-cell RNA sequencing data. bioRxiv; 2021.
- 70Mallick H, Ma S, Franzosa EA, Vatanen T, Morgan XC, Huttenhower C. Experimental design and quantitative analysis of microbial community multiomics. Genome Biol. 2017; 18(1): 228.
- 71Mallick H, Rahnavard A, McIver LJ, et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol. 2021; 17(11):e1009442.
- 72Zhang Y, Thompson KN, Huttenhower C, Franzosa EA. Statistical approaches for differential expression analysis in metatranscriptomics. Bioinformatics. 2021; 37(Suppl_1): i34-i41.
- 73Sun S, Zhu J, Zhou X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat Methods. 2020; 17(2): 193-200.
- 74Li Q, Zhang M, Xie Y, Xiao G. Bayesian modeling of spatial molecular profiling data via Gaussian process. Bioinformatics. 2021; 37(22): 4129-4136.
- 75Clivio O, Lopez R, Regier J, Gayoso A, Jordan MI, Yosef N. Detecting zero-inflated genes in single-cell transcriptomics data. bioRxiv; 2019:794875.
- 76Merkle EC, You D, Preacher KJ. Testing nonnested structural equation models. Psychol Methods. 2016; 21(2): 151.
- 77Stephens M. False discovery rates: a new deal. Biostatistics. 2017; 18(2): 275-294.
- 78Zhu A, Ibrahim JG, Love MI. Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. 2019; 35(12): 2084-2092.
- 79Zhang M, Liu S, Miao Z, Han F, Gottardo R, Sun W. IDEAS: individual level differential expression analysis for single-cell RNA-seq data. Genome Biol. 2022; 23(1): 1-17.
- 80Preisser JS, Das K, Long DL, Divaris K. Marginalized zero-inflated negative binomial regression with application to dental caries. Stat Med. 2016; 35(10): 1722-1735.
- 81Long DL, Preisser JS, Herring AH, Golin CE. A marginalized zero-inflated Poisson regression model with overall exposure effects. Stat Med. 2014; 33(29): 5151-5165.