Characterization of Foundation Models for Longitudinal Similarity Measurement in Medical Video Data

Luisa Neubig, Deirdre Larsen, Takeshi Ikuma, Melda Kunduk, Andreas M. Kist

Published: 2026, Last Modified: 05 May 2026Bildverarbeitung für die Medizin 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Foundation models provide general-purpose image representations that promise to capture structural and semantic information. However, their suitability for measuring similarity across image sequences has not been thoroughly examined. Conventional metrics such as the structural similarity index measure (SSIM) are commonly used to assess frame-to-frame consistency but are sensitive to motion, deformation, and intensity changes, which limits their usefulness for dynamic imaging. In this study, we compare embeddings from a variety of pretrained models, including DINOv2, ResNet50, CLIP, SAM, and LPIPS, to evaluate their ability to represent temporal and structural similarity in videos and their ability to assess the quality of image registration. We analyzed sensitivity to global and local motion in two medical imaging datasets. We focused on videofluoroscopic swallowing studies (VFSS) with global and local motion and the BAGLS dataset of vocal fold vibrations with mainly local motion. Our results indicate differences in how models maintain consistent similarity under motion, and suggest that some embedding-based approaches provide a more stable representation than SSIM without additional fine-tuning.
Loading