CP-BG-1M: A Controlled Multi-View Benchmark for Density and Background Shortcuts in Morphology Profiling

Published: 02 Mar 2026, Last Modified: 08 May 2026MLGenX 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce CP-BG-1M, a diagnostic framework for detecting cell-density and background-mediated shortcut learning in representation models for high-throughput morphology profiling. The dataset contains ∼1 million quality-controlled single-cell tiles from JUMP-CP Target2 dataset imaged across multiple production sites. Each cell is provided in four synchronized views that preserve center-cell morphology while selectively exposing background context or an explicit, source-agnostic density signal. This controlled design enables shortcut testing: morphology-driven embeddings should remain stable across views, whereas shortcut-dependent models show performance drops when background is removed and partial recovery when density is reintroduced without altering morphology. Using a DINOv3 ViT-B baseline with LoRA trained under a chemical-similarity contrastive objective, we reveal strong metric dependence. Segmented representations outperform crops in compound retrieval (recall@10: 0.37–0.38 vs. 0.29–0.30) and phenotypic activity detection (98.67% vs. 67.44–78.41% significant replicate agreement), while also improving batch mixing. These findings show that metric choice can invert model rankings by rewarding shortcut signals, positioning CP-BG-1M as a practical tool to diagnose and mitigate cell-density confounder.
Submission Number: 72
Loading