Dynamic Reflections: Probing Video Representations with Text Alignment

Dynamic Reflections: Probing Video Representations with Text Alignment

ICLR 2026 Conference Submission15905 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Platonic Representation hypothesis, video understanding, video-text alignment

TL;DR: Our study of video-text representation alignment demonstrates that alignment is dramatically improved by using richer test-time data, such as multiple video frames and diverse captions.

Abstract: The alignment of representations from different modalities has recently been shown to provide insights on the structural properties of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of \textit{video} data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that state-of-the-art video encoders (e.g., VideoMAEv2) achieve significantly stronger alignment with text than the best image encoders (e.g., DINOv2), suggesting that motion and temporal context provide important cues for relating complex dynamic scenes to their semantic descriptions. Second, we show that alignment quality is highly sensitive to the richness of the provided annotations; using multiple, diverse captions for a single video yields substantial gains over a single caption. Jointly, these two observations suggest that limited cross-modal alignment observed in previous approaches is to a significant extent due to impoverished representations of \textit{both} visual (static images vs. videos) and text (single caption vs. a collection) data given at test time. Furthermore, we also investigate the correlation between semantic alignment and performance on non-semantic downstream tasks, providing initial evidence that strong semantic grounding may be linked to \textit{general-purpose} video representation and understanding. Ultimately, our work introduces video-text alignment as an informative way to probe the representation power of different encoders for spatio-temporal data.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 15905

Loading