M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Fan BAI; Yuxin Du; Tiejun Huang; Max q.-h. Meng; Bo Zhao

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Fan BAI, Yuxin Du, Tiejun Huang, Max q.-h. Meng, Bo Zhao

19 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Medical image analysis, 3D medical imaging, MLLM

TL;DR: This work introduces the generalist MLLM M3D-LaMed, the largest dataset M3D-Data, and the comprehensive benchmark M3D-Bench for advancing 3D medical image analysis.

Abstract: Medical image analysis is essential to numerous practicals of clinical diagnosis and treatment. However, due to the data scarcity and expensive training cost, previous research has largely focused on 2D medical image analysis, leaving 3D medical images under-explored, despite their important spatial information. This paper aims to advance 3D medical image analysis by leveraging multi-modal large language models (MLLMs). We propose M3D-LaMed, a generalist MLLM for 3D medical image analysis, specializing in eight important tasks, including image-text retrieval, report generation, visual question answering, positioning, segmentation, etc. The spatial pooling perceiver is proposed to reduce the 3D tokens, while preserving spatial information. To train the model, we construct the largest 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs specifically tailored for 3D medical tasks. The 3D multi-modal benchmark, M3D-Bench, is designed, which facilitates the comprehensive evaluation of models across eight tasks. The extensive experiments demonstrate that, as a generalist model, M3D-LaMed shows promising performances and outperforms other specialist models in multiple tasks. With the proposed model, data and benchmark, this work establishes a universal framework that significantly advances the 3D medical image analysis. All data, code and models will be publicly accessible.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1895

Loading