Volume 41, Issue 18 pp. 3492-3510
RESEARCH ARTICLE

Differential expression of single-cell RNA-seq data using Tweedie models

Himel Mallick

Corresponding Author

Himel Mallick

Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, Rahway, New Jersey, USA

Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.

Email: [email protected]

Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.

Email: [email protected]

Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.

Email: [email protected]

Search for more papers by this author
Suvo Chatterjee

Suvo Chatterjee

Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, USA

Search for more papers by this author
Shrabanti Chowdhury

Shrabanti Chowdhury

Department of Genetics and Genomic Sciences and Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, New York, USA

Search for more papers by this author
Saptarshi Chatterjee

Saptarshi Chatterjee

Department of Statistics, Data and Analytics, Eli Lilly & Company, Indianapolis, Indianapolis, Indiana, USA

Search for more papers by this author
Ali Rahnavard

Corresponding Author

Ali Rahnavard

Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA

Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.

Email: [email protected]

Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.

Email: [email protected]

Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.

Email: [email protected]

Search for more papers by this author
Stephanie C. Hicks

Corresponding Author

Stephanie C. Hicks

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA

Correspondence Himel Mallick, Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA.

Email: [email protected]

Ali Rahnavard, Computational Biology Institute, Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052, USA.

Email: [email protected]

Stephanie C. Hicks, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA.

Email: [email protected]

Search for more papers by this author
First published: 02 June 2022
Citations: 7
Himel Mallick and Suvo Chatterjee contributed equally to this article.

Funding information: Bill and Melinda Gates Foundation, Grant/Award Number: INV-016930; Division of Environmental Biology, Grant/Award Number: DEB-2028280; National Human Genome Research Institute, Grant/Award Number: R00HG009007

Abstract

The performance of computational methods and software to identify differentially expressed features in single-cell RNA-sequencing (scRNA-seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA-seq expression features. To model the technological variability in cross-platform scRNA-seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA-seq expression profiles across experimental platforms induced by platform- and gene-specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero-inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero-inflated scRNA-seq data with excessive zero counts. Using both synthetic and published plate- and droplet-based scRNA-seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state-of-the-art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open-source software (R/Bioconductor package) is available at https://github.com/himelmallick/Tweedieverse.

DATA AVAILABILITY STATEMENT

Previously published data used in this study are appropriately cited in the main text as well as in the References section. The detailed data summary is provided in Table S1. Unless otherwise noted, most of the corresponding annotated digital expression matrices are available from the NCBI Gene Expression Omnibus database. In addition, analysis scripts to process and analyse these datasets are available at https://github.com/himelmallick/Tweedie_SingleCell

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.