BSMAD:Bridging Semantic and Structural Manifolds for Robust Cross-Modality Medical Anomaly Detection

SHIH-CHIH LIN; Jenq-Neng Hwang; Shang-Hong Lai

BSMAD:Bridging Semantic and Structural Manifolds for Robust Cross-Modality Medical Anomaly Detection

SHIH-CHIH LIN, Jenq-Neng Hwang, Shang-Hong Lai

Published: 26 Apr 2026, Last Modified: 26 Apr 2026Med-Reasoner 2026 PosterEveryoneRevisionsvalue

Keywords: Anomaly Detection · Vision-Language Models

TL;DR: BSMAD:Bridging Semantic and Structural Manifolds for Robust Cross-Modality Medical Anomaly Detection

Abstract: Zero-shot medical anomaly detection without exhaustive annotations remains a formidable challenge, particularly under extreme cross-modality domain shifts and subtle structural variations. While vision--language models (e.g., CLIP) enable anomaly reasoning via global semantic alignment, they inherently lack sensitivity to fine-grained geometric irregularities and \emph{subtle anatomical defects} (e.g., fuzzy-boundary lesions or soft-tissue deformations). Furthermore, the inherent mismatch between pixel-level evidence and image-level scoring severely degrades cross-modality robustness. In this work, we formulate robust medical anomaly detection as measuring complementary deviations from dual representation manifolds. Specifically, we leverage semantic-sensitive encoders (e.g., CLIP) for global contextual alignment, alongside structure-sensitive self-supervised Vision Transformers (e.g., DINOv3) to capture local geometric consistency. Building upon this perspective, we propose \textbf{BSMAD}, a novel framework that bridges semantic and structural manifolds for universal medical anomaly detection. To synergize these distinct representations, BSMAD elegantly integrates them via a lightweight \textbf{cross-attention mechanism}---where semantic tokens query structural features to explicitly inject geometric priors into the semantic space---followed by residual adapters with adaptive layer-wise gating to stabilize dense anomaly evidence. To resolve cross-level decision mismatches, image-level anomaly scores are derived directly from aggregated pixel responses (top-$k$ pooling) and optimized via a novel pixel--image consistency regularization. Extensive experiments under a stringent cross-modality zero-shot setting---training exclusively on superficial and projective modalities (e.g., dermoscopy, X-ray) while evaluating on completely unseen volumetric and cavity modalities (e.g., MRI, CT, endoscopy)---demonstrate that BSMAD significantly enhances structural sensitivity and achieves state-of-the-art zero-shot robustness without any target-domain fine-tuning.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 30

Loading