Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Biomedical imaging, Gastrointestinal endoscopy, Low-Rank Adaptation (LoRA), Medical image generation, Medical visual question answering, Parameter-efficient fine-tuning (PEFT), Data augmentation, Stable Diffusion, Synthetic data generation, Transformer architectures, Vision-language models, Florence2, Image generation, Cross-attention mechanisms, Polyp’s Detection, Synthetic Medical Imaging
Abstract: The major limitations of the gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline, PEFT model that addresses two fundamental problems, including medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) in the context of Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model scored ROUGE-1 0.92, ROUGE-L 0.91, and improvements in BLEU scores of 0.08–0.24. Fine-tuning on private datasets consistently showed better results than on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90%. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While others lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.
Track: 3. Imaging Informatics
Registration Id: L3NNNCL2B72
Submission Number: 289
Loading