HeliCon: Dual-Level Contrastive Alignment for Robust Medical VQA under Long-Tailed Distribution

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-tailed distribution, Dual-level contrastive alignment, Retrieval-augmented reasoning, Semantic consistency
TL;DR: We propose a dual-level contrastive alignment framework that aligns multimodal inputs with answer embeddings to enhance robustness and semantic consistency in medical VQA, particularly for rare concepts.
Abstract: Learning robust multimodal representations for medical visual question answering (Med-VQA) is challenging due to the imbalanced distribution of semantic concepts. Frequent clinical patterns form robust embedding structures, while rare yet clinically important concepts often yield fragile representations, hindering reliable reasoning. To address this issue, we propose HeliCon, a dual-level contrastive alignment framework that follows a conceptual “double helix” structure. It intertwines two complementary mechanisms: (1) memory banks at the instance and prototype levels, which preserve sample diversity while enforcing semantically meaningful clustering; and (2) contrastive learning objectives at the hard and soft levels, which refine head embeddings and transfer relational knowledge to tail concepts. Collectively, these mechanisms enhance the learning of robust and semantically consistent multimodal representations across both frequent and rare concepts. At inference, a retrieval-augmented mechanism further enriches contextual reasoning by leveraging relevant answer embeddings from the training set. Experiments on Med-VQA benchmarks demonstrate that HeliCon achieves improved performance, with a particularly 3.51$\%$ absolute gain over the state-of-the-art on the PathVQA dataset, highlighting its effectiveness in producing robust representations under long-tailed data distributions.
Primary Area: generative models
Submission Number: 4119
Loading