Video Foundation Model for Medical 3D Segmentation

Published: 01 Jan 2024, Last Modified: 12 Nov 2025ToothFairy/3DTeethLand/STS@MICCAI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video foundation models are gaining significant attention due to their ability to deliver highly accurate and efficient performance when pretrained on large-scale natural video datasets, especially compared to models trained from scratch. These pretrained models offer a substantial advantage in extracting rich features, enabling faster fine-tuning and more accurate downstream tasks. In the medical field, growing research has shown that treating 3D volumes as video sequences is effective. However, the models trained on limited 3D medical data extract unsatisfactory features due to insufficient training, leading to performance loss. Therefore, we adapt video foundation models pretrained on large-scale RGB videos for segmenting 3D CT volumes, to make use of their ability of extracting high-quality features. Our method is evaluated on the ToothFairy2 and AMOS datasets, where it outperforms other transformer-based methods, bridging the gap between video foundation models and 3D medical segmentation. Our results demonstrate that knowledge from video foundation models trained on large-scale RGB datasets can be effectively transferred to medical segmentation tasks, achieving impressive performance even without training on any medical dataset. Additionally, we conduct extensive experiments exploring various design choices for the encoder, decoder, and domain adaptation mechanisms, offering comprehensive insights into adapting video foundation models for 3D medical segmentation.
Loading