Enhancing pixel-level analysis in medical imaging through visual instruction tuning: introducing PLAMi

Published: 2025, Last Modified: 27 Dec 2025Vis. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In the field of medical image analysis, accurately identifying and deeply understanding image details is essential for diagnostics and treatment planning. Despite the significant potential of multimodal large language models (MLLMs) in medical visual question answering (VQA) tasks, they still fall short of achieving pixel-level fine-grained vision-language alignment in processing feature-rich medical images and understanding detailed regions. Here we propose PLAMi, a novel method that leverages visual instruction tuning to enhance pixel-level analysis in medical imaging. By combining BioMedCLIP as the visual encoder with a pixel-level regional feature Extractor, PLAMi integrates language instructions with mask-based regional features, enabling fine-tuning of multimodal large language models (MLLMs) for medical visual question answering (VQA) tasks. Furthermore, to support the proposed method, we constructed the PixelMed-112K dataset, a meticulously compiled collection of 112,000 mask-based regional text medical image VQA samples.Experimental results demonstrate that PLAMi achieves significant improvements in regional recognition accuracy (+1% on key metrics) and description quality, underlining the method’s effectiveness and the necessity of pixel-level regional understanding in medical imaging analysis. The code and related datasets can be found at https://github.com/MaochengBai/PLAMi.
Loading