Keywords: Radiology report generation, vision-language models, chest X-ray, supervised fine-tuning, reinforcement learning, visual token compression.
Abstract: We investigate whether vision-language models designed for document understanding can be repurposed for radiology report generation. Surprisingly, DeepSeek-OCR, an optical character recognition-centric model with no medical pretraining, achieves state-of-the-art performance on chest X-ray report generation (GREEN = 0.846) after supervised fine-tuning, outperforming medical-domain models. We attribute this to aggressive visual token compression, which proves effective for encoding radiographic detail. Component analysis reveals that location accuracy and entity matching are the main bottlenecks in zero-shot models, with DeepSeek-OCR showing +262\% and +230\% improvements, respectively, after fine-tuning. We further show that reinforcement learning with RadGraph-based clinical rewards yields gains beyond supervised fine-tuning saturation, improving entity matching by 6\% on Qwen3-VL-4B. Our results suggest that document-understanding architectures offer an underexplored pathway for medical image interpretation.
Primary Subject Area: Foundation Models
Secondary Subject Area: Application: Radiology
Registration Requirement: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 88
Loading