Dolphin: A multimodal large language model for Ultrasound Understanding

ICLR 2026 Conference Submission6696 Authors

16 Sept 2025 (modified: 26 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Ultrasound; Large multimodal models; Mixed-domain reasoning; Emergent reasoning
TL;DR: We introduce Dolphin series, the first general-purpose ultrasound multimodal foundation model. It achieves SOTA on U2-Bench and enables ultrasound reasoning through mixed-training, offering a new paradigm for medical multimodal models.
Abstract: Ultrasound is one of the most widely used imaging modalities in clinical practice. Unlike CT and MRI, ultrasound imaging is highly operator dependent, with significant variations across different anatomical regions. Therefore, the ultrasound domain has a significant need for a comprehensive understanding of general ultrasound imaging. To address this, we introduce Dolphin, the first multimodal large language model for ultrasound understanding, including chat and reasoning version. We curate a training dataset with over 2,000,000 instruction–response pairs, integrating domain-specific knowledge in the field of ultrasound, including textbooks, specific clinical guidelines, and public ultrasound datasets, along with synthetic samples and general-domain corpora. Meanwhile, we established the Dolphin Ultrasound Data Protocol to standardize various types of ultrasound data, ensuring consistency, interoperability, and quality across the dataset. While most multimodal medical models do not emphasize medical reasoning capabilities, ultrasound understanding demands strong comprehension and reasoning abilities due to the inherent complexity and variability of ultrasound data. However, real-world ultrasound reasoning data is scarce and difficult to collect, which limits the development of models with advanced understanding. To address this, We propose a three-stage training strategy using easily accessible ultrasound {question-answering} data and synthetic deep-reasoning general-domain data, combining post-training, instruction tuning, and Ultrasound Answer Reward Preference Optimization (UARPO) based reinforcement learning to progressively improve reasoning. On the U2-Bench benchmark across eight clinical ultrasound tasks, Dolphin establishes a new state-of-the-art with a U2-score of 0.5835. Our experiments show that, with reasoning data from other domains, Dolphin exhibits robust reasoning capabilities on complex ultrasound tasks. Moreover, Dolphin achieves higher diagnostic accuracy in deep reasoning mode than in standard mode, indicating that generalized reasoning skills can be effectively transferred to specialized medical domains.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6696
Loading