HiPro-CT: A Hierarchical Probabilistic Framework for 3D Medical Vision-Language Alignment

04 Dec 2025 (modified: 15 Dec 2025)MIDL 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Computed Tomography, Probabilistic Embedding, Uncertainty Modeling, Fine-grained Alignment
Abstract: Vision-Language Models (VLMs) in 3D medical imaging face two core obstacles: local feature dilution due to granularity mismatch between volumetric data and textual reports, and deterministic embeddings’ failure to capture clinical descriptions’ intrinsic semantic uncertainty. To resolve these issues, we propose HiPro-CT—a novel Hierarchical Probabilistic framework for 3D medical vision-language alignment. Unlike conventional point-based methods, HiPro-CT projects images and texts into Gaussian probability distributions, leveraging variance to explicitly measure uncertainty and boost robustness against incompleteness and polysemy. It incorporates a Soft Masked Pooling strategy that conducts weighted feature aggregation under anatomical mask guidance, achieving accurate organ-level alignment while retaining boundary context; additionally, a Hierarchical Inclusion Loss is designed to impose geometric constraints on the embedding space, ensuring global volume distributions statistically contain local organ distributions. Extensive experiments on the RadGenome-Chest CT dataset show that HiPro-CT significantly surpasses state-of-the-art deterministic baselines (e.g., CT-CLIP) in zero-shot multi-abnormality detection and cross-modal retrieval, validating the effectiveness of combining fine-grained anatomical supervision with probabilistic representation learning.
Primary Subject Area: Foundation Models
Secondary Subject Area: Uncertainty Estimation
Registration Requirement: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 385
Loading