HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Tianwei Lin; Wenqiao Zhang; Sijing Li; Yuqian Yuan; Binhe Yu; Haoyuan Li; Wanggui He; Hao Jiang; Mengze Li; Song xiaohui; Siliang Tang; Jun Xiao; Hui Lin; Yueting Zhuang; Beng Chin Ooi

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Song xiaohui, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi

Published: 01 May 2025, Last Modified: 28 Sept 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC0 1.0

Abstract: We present **HealthGPT**, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained Large Language Models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation **(H-LoRA)** technique, which is complemented by a tailored hierarchical visual perception **(HVP)** approach and a three-stage learning strategy **(TLS)**. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called **VL-Health**. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.

Lay Summary: In general applications, AI models have been developed to both understand and generate different types of data, such as recognizing objects in images or generating pictures based on text descriptions. However, such unified models have not yet been explored or developed in the field of medical imaging. This is largely due to a lack of large-scale, high-quality medical data and the distinct requirements of understanding and generation tasks in healthcare. To address this, we introduce **HealthGPT**, the first large vision-language model designed to unify both understanding and generation tasks in the medical domain. We use a quantization technique to represent medical images as sequences of "tokens" similar to text, allowing the model to process and generate both images and text in a consistent way. We also propose a novel model architecture that stores task-specific knowledge separately and combines it using a multi-stage training process. To support this approach, we built a new dataset called **VL-Health**, which includes a range of medical image tasks covering both understanding and generation. Our experiments show that HealthGPT achieves strong performance on both types of tasks. We open-source our dataset and model weights to encourage further research on unified AI models in healthcare.

Link To Code: https://github.com/DCDmllm/HealthGPT

Primary Area: Applications->Health / Medicine

Keywords: Medical Large Vision-Language Models; Multi-Modal Comprehension and Generation

Submission Number: 2225

Loading