Clustering promise to reveal natural groupings in data—cell types, disease subtypes, functional modules. But clustering almost always "works" in the sense that it produces visually satisfying partitions, regardless of whether those partitions reflect biological reality or statistical artifacts. Give a clustering algorithm random data and it will still produce clusters; the question is whether those clusters mean anything.
The problem is that most biological data contains continuous variation rather than discrete categories, yet we force it into bins because human minds handle categories better than gradients. A t-SNE plot showing distinct islands of cells looks compelling, but those sharp boundaries may reflect algorithmic choices rather than genuine biological discontinuities.
This matters because downstream analyses treat these clusters as if they were real biological entities—we name them, study their markers, make mechanistic claims about their functions. The most honest approach acknowledges that clustering is a useful fiction for organizing complexity, not a discovery of natural kinds, and that the "right" clustering often depends on what question we’re asking rather than being an intrinsic property of the data.
Dec 5
at
12:26 PM
Relevant people
Log in or sign up
Join the most interesting and insightful discussions.