Short talk: Sandwich Estimators for Differential Expression Analysis of Multi-Subject scRNA-seq data

Sandwich Estimators for Differential Expression Analysis of Multi-Subject scRNA-seq data

Milan Malfait,Jeroen Gilis,Koen Van den Berge,Alemu Takele Assefa,Bie Verbist,Lieven Clement

Department of Applied Mathematics, Computer Science and Statistics, Ghent University

Abstract

Single-cell transcriptomics (scRNA-seq) is a disruptive technology that has the promise to further unravel the molecular basis of complex biological processes. Recently, there is a shift toward single-cell RNA-seq (scRNA-seq) experiments with multiple biological subjects. Indeed, multiple bio-repeats are key to extracting reproducible transcript and gene signatures and biomarkers. However, as multiple cells of multiple cell types from the same subject are measured, the expression profiles are bound to be correlated. This poses new challenges for the statistical analysis of this data. Currently, most differential expression (DE) methods cannot address this hierarchical correlation structure, rendering them obsolete for reliably prioritizing DE transcripts and genes. A popular approach is to aggregate the data by summarizing the counts of all cells from a given subject, followed by the application of traditional bulk RNA-seq DE methods. However, this approach has a few drawbacks. The aggregation causes a loss of information since we no longer have the fine-grained single-cell resolution. Furthermore, separate modeling of cell subpopulations is required, due to possibly different mean-variance trends and again because of the correlation structure that must be addressed when considering multiple cell types from the same subject. Hence, the bulk approach does not allow prioritizing genes that respond differently to treatment in distinct cell types, i.e. the interaction effects between cell types and treatments. Recent benchmarks have also suggested that the aggregation approach might be underpowered. Here, we propose the use of Generalized Estimating Equations (GEE), which can address the hierarchical correlation structure of multi-subject scRNA-seq data. We tested this approach using mock analyses and simulation studies based on real scRNA-seq data. In contrast to standard and bespoke DE methods for assessing DE at the single-cell level, we show that GEEs can correctly control the false discovery rate (FDR) and keep uniform p-value distributions under the null hypothesis. We find that the GEE method has similar performance as the aggregation approach for simple experimental designs but can also address more complex designs involving multiple cell types. Based on our results, we provide guidelines for researchers to empower them to make the most of their data.