Keywords: Vision-Language Agent, Medical Image Analysis, Neuroimaging
Abstract: We present VoxelPrompt, an end-to-end vision system that tackles composite radiological tasks. Given a user prompt, VoxelPrompt integrates a language model that generates executable code to invoke a novel, jointly-trained vision network. This adaptable network can integrate any number of volumetric (3D) inputs across heterogeneous real-world clinical modalities to segment and characterize diverse anatomy and pathology. Predicted code employs this network to carry out analytical steps to automate practical quantitative pipelines, such as measuring the growth of a tumor across visits, which often require practitioners to painstakingly combine multiple specialized but brittle tools. We evaluate VoxelPrompt using diverse neuroimaging tasks and show that it can delineate hundreds of anatomical and pathological features, measure complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt performs these objectives with an accuracy similar to that of specialist single-task models for image analysis, while facilitating a broad range of biomedical workflows.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6277
Loading