BSMAD:Bridging Semantic and Structural Manifolds for Robust Cross-Modality Medical Anomaly Detection
Keywords: Anomaly Detection · Vision-Language Models
TL;DR: BSMAD:Bridging Semantic and Structural Manifolds for Robust Cross-Modality Medical Anomaly Detection
Abstract: Zero-shot medical anomaly detection without exhaustive annotations remains a formidable challenge, particularly under extreme cross-modality domain shifts and subtle structural variations.
While vision--language models (e.g., CLIP) enable anomaly reasoning via global semantic alignment, they inherently lack sensitivity to fine-grained geometric irregularities and \emph{subtle anatomical defects} (e.g., fuzzy-boundary lesions or soft-tissue deformations).
Furthermore, the inherent mismatch between pixel-level evidence and image-level scoring severely degrades cross-modality robustness.
In this work, we formulate robust medical anomaly detection as measuring complementary deviations from dual representation manifolds.
Specifically, we leverage semantic-sensitive encoders (e.g., CLIP) for global contextual alignment, alongside structure-sensitive self-supervised Vision Transformers (e.g., DINOv3) to capture local geometric consistency.
Building upon this perspective, we propose \textbf{BSMAD}, a novel framework that bridges semantic and structural manifolds for universal medical anomaly detection.
To synergize these distinct representations, BSMAD elegantly integrates them via a lightweight \textbf{cross-attention mechanism}---where semantic tokens query structural features to explicitly inject geometric priors into the semantic space---followed by residual adapters with adaptive layer-wise gating to stabilize dense anomaly evidence.
To resolve cross-level decision mismatches, image-level anomaly scores are derived directly from aggregated pixel responses (top-$k$ pooling) and optimized via a novel pixel--image consistency regularization.
Extensive experiments under a stringent cross-modality zero-shot setting---training exclusively on superficial and projective modalities (e.g., dermoscopy, X-ray) while evaluating on completely unseen volumetric and cavity modalities (e.g., MRI, CT, endoscopy)---demonstrate that BSMAD significantly enhances structural sensitivity and achieves state-of-the-art zero-shot robustness without any target-domain fine-tuning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 30
Loading