Abstract: This study presents Medical Vision Generalist (MVG), a vision-centric generalist model capable of handling various medical imaging tasks---such as cross-modal synthesis, image segmentation, denoising, and inpainting---within a unified single image-to-image generation framework. Specifically, MVG employs an in-context generation strategy that standardizes the handling of inputs and outputs as images. By treating these tasks as an image generation process conditioned on prompt image-label pairs and input images, this approach enables a flexible unification of various tasks, even those spanning different modalities and datasets. To capitalize on both local and global context, we design a hybrid method combining masked image modeling with autoregressive training for conditional image generation. This hybrid approach yields the most robust performance across all involved medical imaging tasks. To rigorously evaluate MVG's capabilities, we curated the first comprehensive generalist medical vision benchmark, comprising 13 datasets, over 2 million training images, and spanning four imaging modalities (CT, MRI, X-ray, and micro-ultrasound). Our results consistently establish MVG’s superior performance, outperforming existing vision generalists, such as Painter and LVM, and in some cases, matching the efficacy of specialized task-specific models. Furthermore, MVG exhibits strong scalability, with its performance demonstrably improving when trained on a more diverse set of tasks, and can be effectively adapted to unseen datasets with only minimal task-specific samples. Code will be made available.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lei_Wang13
Submission Number: 4137
Loading