Boundary Guidance for Efficient 3D CT Vision–Language Reasoning

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: boundary
Abstract: Vision--language models (VLMs) for 3D computed tomography (CT) analysis face the dual challenge of achieving precise visual grounding in high-dimensional data while maintaining computational efficiency. Although state-of-the-art models with multi-billion-parameter decoders have demonstrated strong performance, their attention mechanisms are often distracted by clinically irrelevant but visually similar confounding features, leading to errors in reasoning. To mitigate this, we introduce \textbf{Dual-Polarity Bounding Box Prompting}, a novel visual instruction method that provides both positive and negative spatial cues. For each question, we overlay a \textbf{green box} on the region of interest (ROI) and a \textbf{red box} on a plausible but incorrect distractor region. This contrastive prompting scheme explicitly trains the model to focus its attention on relevant evidence while actively ignoring confounding information. We pair this technique with compact Qwen decoders (0.5B to 3B parameters) and evaluate it on the RadGenome-ChestCT and PMC-VQA benchmarks. Our results show that this dual-prompt strategy substantially improves both closed-ended and open-ended VQA performance. Notably, our 1.5B model, guided by dual-polarity prompts, surpasses the accuracy of a 7B baseline model, demonstrating that explicit negative guidance is a highly effective, parameter-efficient approach to enhancing the reliability and evidence-based reasoning of medical VLMs.
Submission Number: 104
Loading