Abstract: Video-text retrieval is a crucial task in numerous computer vision applications. In this paper, we focus on video-text retrieval involving complex action compositions, where a single video encompasses multiple primitive actions such as “sitting up”, “opening door”, “cooking food”, and “eating.” Despite the common occurrences in real-world scenarios, such action-compositional videos have received limited research attention, often leading to significant performance degradations in existing retrieval methods. To address this challenge, we present Hyperbolic Video-tExt Retrieval (HOVER), which models the hierarchical semantic relationships between videos and texts by embedding them in a low-dimensional hyperbolic space. Since hyperbolic space provides a geometric prior that naturally aligns with hierarchical data, it allows for more efficient and generalizable representations of video-text semantic hierarchies. HOVER first longitudinally decomposes each video into a hierarchical action tree, where primitive mono-actions are represented as leaf nodes and increasingly complex action compositions as parent nodes. The semantic structures and temporal dependencies of videos/texts are then encoded in hyperbolic space by exploiting hyperbolic distance, norm, and relative cosine similarity. Experimental results show that HOVER significantly outperforms traditional Euclidean-based methods, particularly in scenarios with limited training labels, achieving a notable performance improvement of 28.83%. Additionally, the hyperbolic video-text embeddings learned by HOVER demonstrate strong generalization across new datasets containing videos with varying levels of action complexity. The source code is available at https://github.com/shi-rq/HOVER
External IDs:dblp:journals/tip/WenCSJYGYZ25
Loading