Abstract: Foundation models provide general-purpose image representations that promise to capture structural and semantic information. However, their suitability for measuring similarity across image sequences has not been thoroughly examined. Conventional metrics such as the structural similarity index measure (SSIM) are commonly used to assess frame-to-frame consistency but are sensitive to motion, deformation, and intensity changes, which limits their usefulness for dynamic imaging. In this study, we compare embeddings from a variety of pretrained models, including DINOv2, ResNet50, CLIP, SAM, and LPIPS, to evaluate their ability to represent temporal and structural similarity in videos and their ability to assess the quality of image registration. We analyzed sensitivity to global and local motion in two medical imaging datasets. We focused on videofluoroscopic swallowing studies (VFSS) with global and local motion and the BAGLS dataset of vocal fold vibrations with mainly local motion. Our results indicate differences in how models maintain consistent similarity under motion, and suggest that some embedding-based approaches provide a more stable representation than SSIM without additional fine-tuning.
External IDs:dblp:conf/bildmed/NeubigLIKK26
Loading