CHOIR improves significance-based detection of cell types and states from single-cell data

Year: 2025;  
Journal: Nature Genetics;  
Volume: 57;  
Issue: 5;  
Abstract:

Clustering is a critical step in the analysis of single-cell data, enabling the discovery and characterization of cell types and states. However, most popular clustering tools do not subject results to statistical inference testing, leading to risks of overclustering or underclustering data and often resulting in ineffective identification of cell types with widely differing prevalence. To address these challenges, we present CHOIR (cluster hierarchy optimization by iterative random forests), which applies a framework of random forest classifiers and permutation tests across a hierarchical clustering tree to statistically determine clusters representing distinct populations. We demonstrate the performance of CHOIR through extensive benchmarking against 15 existing clustering methods across 230 simulated and five real single-cell RNA sequencing, assay for transposase-accessible chromatin sequencing, spatial transcriptomic and multi-omic datasets. CHOIR can be applied to any single-cell data type and provides a flexible, scalable and robust solution to the challenge of identifying biologically relevant cell groupings within heterogeneous single-cell data.