Keywords: Foundation models, Echocardiography, Spatio-temporal landmark detection, Fine-tuning
TL;DR: We investigate whether state-of-the-art video-based FMs for echocardiography can perform precise spatio-temporal landmark detection without extensive fine-tuning.
Abstract: Foundation models (FMs) have shown remarkable capabilities across computer vision tasks, yet their effectiveness for complex medical downstream tasks remains underexplored. This work investigates whether state-of-the-art video-based FMs for echocardiography can perform precise spatio-temporal landmark detection without extensive fine-tuning. We evaluate two recent powerful FMs, namely EchoPrime, and PanEcho, pre-trained on few millions of echocardiographic video-text pairs, for left-ventricular contour detection on EchoNet-Dynamic. We compare encoder regimes (frozen, partially frozen, fully trainable) and decoder heads (MLP vs.\ GCN), and benchmark against strong non-FM backbones (ResNet-18 2D/3D, ViT-Base, MViTv2-Small). Frozen encoders perform poorly and variably ($\approx$78.00 Dice, ED), whereas selectively unfreezing two blocks with GCN+augmentation yields a large jump ($91.71\pm3.49$ Dice, ED), recovering most of the improvement. Fully trainable EchoPrime (GCN+augmentation) achieves $93.13\pm3.11/90.95\pm3.71$ Dice (ED/ES), which is SOTA for regression-based models on EchoNet. Deploying separate, fully fine-tuned models for each task quickly becomes impractical in resource-constrained settings. Our results suggest that partially fine-tuning the FM is a resource-efficient strategy that recovers most of the performance benefits of end-to-end training, while avoiding the overhead of maintaining a separate model for each task. The code is available at \href{https://github.com/preetrajb/EchoVLMLandmarks}{https://github.com/preetrajb/EchoVLMLandmarks}.
Git: https://github.com/preetrajb/EchoVLMLandmarks
Serve As Reviewer: ~Preetraj_Bhoodoo1
Submission Number: 67
Loading