Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Huu Tien Nguyen; Dac Thai Nguyen; Duc Nguyen The Minh; Trung Thanh Nguyen; Thao Nguyen Truong; Hieu Pham; Johan Barthelemy; Tran Minh Quan; Quoc Viet Hung Nguyen; Thanh Tam Nguyen; Mai Hong Son; Chau Quynh Anh; Thanh Trung Nguyen; Phi Le Nguyen

Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Huu Tien Nguyen, Dac Thai Nguyen, Duc Nguyen The Minh, Trung Thanh Nguyen, Thao Nguyen Truong, Hieu Pham, Johan Barthelemy, Tran Minh Quan, Quoc Viet Hung Nguyen, Thanh Tam Nguyen, Mai Hong Son, Chau Quynh Anh, Thanh Trung Nguyen, Phi Le Nguyen

Published: 18 Sept 2025, Last Modified: 23 Apr 2026NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Benchmarking, Vision-Language Pretraining, Medical Vision-Language Models, PET/CT, Report Generation

Abstract: Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. However, despite these advancements, the models still underperform on clinically critical criteria, particularly the diagnosis of lung cancer, indicating substantial room for future improvement. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.

Croissant File: zip

Dataset URL: https://huggingface.co/datasets/dacthai2807/ViMed-PET

Code URL: https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen.git

Supplementary Material: zip

Primary Area: AL/ML Datasets & Benchmarks for health sciences (e.g. climate, health, life sciences, physics, social sciences)

Flagged For Ethics Review: true

Submission Number: 902

Loading