QwenVLConnector: A Fast, Unified Medical VLM Chatbot for Fine-Grained Clinical Perception and Text Generation

Le Thien Phuc Nguyen; Thien Nguyen; Thanh-Huy Nguyen; Gia Minh Hoang; Anh Mai Vu; Ulas Bagci

QwenVLConnector: A Fast, Unified Medical VLM Chatbot for Fine-Grained Clinical Perception and Text Generation

Le Thien Phuc Nguyen, Thien Nguyen, Thanh-Huy Nguyen, Gia Minh Hoang, Anh Mai Vu, Ulas Bagci

28 Aug 2025 (modified: 16 Sept 2025)MICCAI 2025 Challenge FLARE SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal large language model, Medical VLM, Dense connector, FLARE 2D

TL;DR: a fast and efficient vlm chat bot for fine-grained medical image tasks

Abstract: Most medical vision–language models (VLMs) excel at open ended report generation and VQA but lack native support for structured, fine-grained perception (detection, counting, regression) within one interface. We present QwenVLConnector, a Qwen2.5-VL–based chatbot that unifies classification, multi-label classification, textualized detection, counting, regression, and free-form report generation under a single next-token objective via a lightweight dense multi-layer Connector that fuses multi-scale visual features without increasing sequence length. On FLARE-2D, QwenVLConnector improves detection F1 from 0.55 to 0.85 (+0.30), raises single-label classification from 0.37 to 0.51 (+0.15), boosts report-generation GREEN by up to 18.3 points. Our code can be found at https://github.com/plnguyen2908/QwenConnector.

Submission Number: 3

Loading