Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Biomedical imaging, Gastrointestinal endoscopy, Low-Rank Adaptation (LoRA), Medical image generation, Medical visual question answering, Parameter-efficient fine-tuning (PEFT), Data augmentation, Stable Diffusion, Synthetic data generation, Transformer architectures, Vision-language models, Florence2, Image generation, Cross-attention mechanisms, Polyp’s Detection, Synthetic Medical Imaging
Abstract: The major limitations of the gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict
privacy policies, and significant bottlenecks in conventional model
fine-tuning. Such limitations impede the successful application of
sophisticated AI models in clinical practice, particularly affecting
the reliability and scalability of diagnosis. In this paper, we
present a dual-pipeline, PEFT model that addresses two fundamental problems, including medical Visual Question Answering
(VQA) and the generation of privacy-preserving synthetic data.
For clinical VQA, we adopt the Florence-2 vision-language model.
Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously,
we employ Low-Rank Adaptation (LoRA) in the context of
Stable Diffusion 2.1 to generate high-quality GI images that
enhance training databases without violating patient privacy. This
research utilized the Kvasir-VQA dataset. Our Florence-2 VQA
model scored ROUGE-1 0.92, ROUGE-L 0.91, and improvements
in BLEU scores of 0.08–0.24. Fine-tuning on private datasets
consistently showed better results than on public datasets. The
rank-4 LoRA synthesis achieved optimal performance with a
fidelity score of 0.290, an agreement score of 0.730, and a Frechet
BiomedCLIP Distance (FBD) of 1450, reducing computational
costs by almost 90%. This framework improves the clinical
potential of AI in GI endoscopy. Compared to FLUX, MSDM,
and Kandinsky 2.2, our model demonstrates superior FBD and
strong semantic alignment. While others lead in Fidelity or
Agreement, our lower FBD indicates better image-text coherence.
These results establish our approach as a robust solution for
enhancing VQA and synthetic data generation in clinical AI.
Track: 3. Imaging Informatics
Registration Id: L3NNNCL2B72
Submission Number: 289
Loading