Do Clinical VLMs Need Dense Visual Tokens? Probing Spatial Grounding in Radiology Report Generation
Keywords: Clinical vision-language models, radiology report generation, chest X-ray, visual grounding, spatial visual information
TL;DR: Clinical VLMs can nearly match radiology report generation performance using only a highly compressed visual signal, suggesting that strong text-generation metrics may overstate fine-grained visual grounding.
Abstract: Clinical vision-language models (VLMs) for chest X-ray report generation are typically evaluated on generated text quality, but strong generation performance does not necessarily imply fine-grained visual grounding.
In this work, we empirically evaluate how much spatial visual information a clinical VLM uses for radiology report generation.
Using our own implementation of LLaVARad, a state-of-the-art VLM for radiology report generation, we apply a simple intervention framework that selects, removes, or randomly samples visual tokens before projection into the language model.
We find that dense visual token samples are not required, as
compressing the full set of visual patch tokens (i.e., T=1369) into a single mean-pooled token, preserves baseline performance.
Region-level interventions produce measurable but modest degradation, with the largest effects in CheXbert-based clinical metrics.
Notably, retaining only $\sim1\%$ (i.e., T=14) of randomly sampled visual tokens before mean-pooling, nearly matches the full-token setting.
These results suggest that the model uses the image primarily through a low-dimensional visual conditioning signal rather than strong fine-grained spatial grounding, raising concerns about the limited use of visual inputs by current clinical VLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 85
Loading