Scalable Batch Correction for Cell Painting via Batch-Dependent Kernels and Adaptive Sampling

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Computational Biology, Cell Painting, Batch Correction, Low-Rank Matrix Approximation, Nearest Neighbor Graphs, High-dimensional Data, Image-based Profiling
TL;DR: BALANS is a fast, theoretically grounded method for batch correction in Cell Painting data, which estimates a sparse affinity matrix using batch-dependent scales and an optimal adaptive sampling strategy.
Abstract: Cell Painting is a microscopy-based, high-content imaging assay that captures rich morphological profiles of cells. By revealing how cells respond to different chemical perturbations, it can provide valuable insight for drug discovery. However, Cell Painting data suffers from batch effects caused by variations across laboratories, instruments, and protocols. These batch-dependent artifacts obscure biological signals, especially at scale. We introduce BALANS (read "balance'')---Batch Alignment via Local Affinities and Subsampling---a scalable batch correction method that aligns samples across batches using a smoothing affinity matrix constructed based on pairwise distances between the data points. Given $n$ data points, BALANS constructs a sparse affinity matrix $A \in \mathbb{R}^{n \times n}$ following two key ideas. First, for data points $i$ and $j$, it defines a local ``scale'' based on the distance from $i$ to its $k$-th nearest neighbor within the batch of $j$. The affinities $A_{ij}$ are then computed using a Gaussian kernel calibrated by the local scales to account for batch-specific variation. Second, instead of populating all $n^2$ entries of $A$, BALANS employs an adaptive sampling strategy that incrementally computes rows corresponding to points with low cumulative neighbor coverage and, within each row, retains the highest affinities. This yields a sparse but informative submatrix of $A$. We prove that this novel sampling strategy is order-optimal in terms of sample complexity and has an approximation guarantee. Crucially, BALANS runs in almost-linear time with respect to the number of data points. We evaluate BALANS across many real-world datasets spanning diverse biological conditions and batch structures. We demonstrate scalability on these real-world datasets and perform controlled scalability experiments on large-scale synthetic data to assess efficiency under varying size and complexity. In both cases, BALANS outperforms native implementations of popular batch correction methods in runtime without compromising batch correction quality.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 12225
Loading