QwenVLConnector: A Fast, Unified Medical VLM Chatbot for Fine-Grained Clinical Perception and Text Generation

28 Aug 2025 (modified: 16 Sept 2025)MICCAI 2025 Challenge FLARE SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal large language model, Medical VLM, Dense connector, FLARE 2D
TL;DR: a fast and efficient vlm chat bot for fine-grained medical image tasks
Abstract: Most medical vision–language models (VLMs) excel at open ended report generation and VQA but lack native support for structured, fine-grained perception (detection, counting, regression) within one interface. We present QwenVLConnector, a Qwen2.5-VL–based chatbot that unifies classification, multi-label classification, textualized detection, counting, regression, and free-form report generation under a single next-token objective via a lightweight dense multi-layer Connector that fuses multi-scale visual features without increasing sequence length. On FLARE-2D, QwenVLConnector improves detection F1 from 0.55 to 0.85 (+0.30), raises single-label classification from 0.37 to 0.51 (+0.15), boosts report-generation GREEN by up to 18.3 points. Our code can be found at https://github.com/plnguyen2908/QwenConnector.
Submission Number: 3
Loading