Abstract: 3D computed-tomography (CT) volumes present unique challenges for multimodal reasoning because vision–language models (VLMs) must align long spatial–temporal contexts with radiological text while remaining memory-efficient. Med3DVLM recently showed that a dedicated 3D vision encoder combined with a 7B-parameter language decoder can reach 79.95% closed-ended accuracy and 36.76% METEOR on the M3D benchmark. However, the attention maps in Vision-Language Models (VLMs) often fail to focus on the areas of greatest relevance, instead dispersing attention across multiple regions in a sparse and less targeted manner. We therefore introduce a novel slice-wise visual–instruction prompting scheme, which enhances model performance by overlaying a sub-voxel-thin, colored outline around the anatomy relevant to the question, applied to each 2D slice within the 3D volume. Experiments on the RadGenome-ChestCT and PMC-VQA corpora show that the group of Qwen variants (1.5B, 3B, and 0.5B) with visual prompts performs similarly to the baseline 7B Qwen model without prompts, while reducing GPU memory usage. Additionally, prompt-guided fine-tuning improves closed-ended accuracy and boosts BLEU-4, ROUGE-L, and METEOR scores for open-ended VQA.
External IDs:dblp:conf/miccai/KimH25
Loading