Keywords: Medical image parsing, Parameter-efficient fine-tuning, Vision-language model, Medical visual question answering
Abstract: Medical image parsing presents a unique challenge due to the diversity of imaging modalities and the wide range of diagnostic tasks required in clinical workflows, including classification, detection, and report generation. Traditional approaches often rely on task-specific models, which limit both scalability and generalization. Recent advances in vision-language models (VLMs) offer promising avenues for unifying these tasks; however, many existing solutions suffer from high computational costs and limited adaptability. In this work, we propose ME-VLIP, a modular and efficient framework built upon InternVL3-8B, fine-tuned using quantized low-rank adaptation and guided by a zero-shot task classification module. Our system demonstrates robust performance across seven tasks, spanning eight imaging modalities. We evaluate our approach on the FLARE 2025-Task 5 benchmark, showing substantial performance gains over the base model, with the following task-specific improvements: classification (0.74 balanced accuracy), multi-label classification (0.57 F1 score), detection (0.82 F1 score), cell counting (251.6 MAE), regression (11.84 MAE), and report generation (0.71 GREEN score). Comparative analysis indicates that our method outperforms other state-of-the-art VLMs, underscoring the effectiveness of parameter-efficient domain adaptation for versatile medical image parsing.
Submission Number: 6
Loading