Keywords: Boundary
Abstract: Interpreting volumetric CT with vision–language models (VLMs) demands alignment of long-range spatial–temporal evidence with radiology text under tight memory budgets. In this setting, Med3DVLM—a 3D vision encoder coupled to a 7B decoder—reports 79.95 percent closed-ended accuracy and 36.76 METEOR on M3D. Yet contemporary VLM attention often diffuses, lighting up many non-diagnostic regions instead of the truly salient ones. We propose slice-wise visual-instruction prompting: on every axial slice of the 3D volume, a sub-voxel thin, colored contour traces the anatomy referenced by the question, turning the image itself into a focus cue. On RadGenome-ChestCT and PMC-VQA, Qwen variants (0.5B/1.5B/3B) with these prompts perform on par with a prompt-free Qwen-7B while cutting GPU memory. Moreover, prompt-guided fine-tuning further lifts closed-ended accuracy and improves open-ended VQA on BLEU-4, ROUGE-L, and METEOR.
Submission Number: 41
Loading