CP-BG-1M: A Controlled Multi-View Benchmark for Density and Background Shortcuts in Morphology Profiling

Published: 02 Mar 2026, Last Modified: 17 Apr 2026MLGenX 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce CP-BG-1M, a diagnostic framework for detecting cell-density and background-mediated shortcut learning in representation models for high-throughput morphology profiling. The dataset contains ∼1 million quality-controlled single-cell tiles from JUMP-CP Target2 dataset imaged across multiple production sites. Each cell is provided in four synchronized views that preserve center-cell morphology while selectively exposing background context or an explicit, source-agnostic density signal. This controlled design enables shortcut testing: morphology-driven embeddings should remain stable across views, whereas shortcut-dependent models show performance drops when background is removed and partial recovery when density is reintroduced without altering morphology. Using a DINOv3 ViT-B baseline with LoRA trained under a chemical-similarity contrastive objective, we reveal strong metric dependence. Segmented representations outperform crops in compound retrieval (recall@10: 0.37–0.38 vs. 0.29–0.30) and phenotypic activity detection (98.67% vs. 67.44–78.41% significant replicate agreement), while also improving batch mixing. These findings show that metric choice can invert model rankings by rewarding shortcut signals, positioning CP-BG-1M as a practical tool to diagnose and mitigate cell-density confounder.
Track: Main track
AI Policy Confirmation: I confirm that this submission clearly discloses the role of AI systems and human contributors and complies with the ICLR 2026 Policies on Large Language Model Usage and the ICLR Code of Ethics.
Submission Number: 72
Loading