HOVER: Hyperbolic Video-text Retrieval

Ruiqi Shi; Jun Wen; Wei Ji; Menglin Yang; Difei Gao; Roger Zimmermann

HOVER: Hyperbolic Video-text Retrieval

Ruiqi Shi, Jun Wen, Wei Ji, Menglin Yang, Difei Gao, Roger Zimmermann

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: video-text retrieval, hyperbolic representation, multi-modal learning

TL;DR: We propose Hyperbolic Video-text Retrieval (HOVER), which explicitly encodes the hierarchical semantic structure of videos and texts, and align them in the hyperbolic space.

Abstract: Retrieving complex videos with compositional actions is challenging but still with few attentions given. Existing video-text retrieval methods ignore the multi-level semantic structures between mono-action videos and complex compositional videos, e.g., one simultaneously containing "sitting up", "opening door", "cooking food", "eating", etc. In this paper, we propose to jointly embed videos and texts into a hyperbolic space where their hierarchical semantic relationships are explicitly encoded. Specifically, a video with action compositions is first decomposed longitudinally into an action tree with mono-action leaf or child nodes and increasingly complex parent nodes. Then, the is-a semantic relationship in videos/texts is represented in the hyperbolic space by employing hyperbolic norm constraints. These constraints ensure that parents have smaller norms than their children, thereby placing parents in higher hierarchical positions compared to their children. Additionally, their temporal relationship is captured by utilizing relative cosine distances within the hyperbolic space. Experimental results show that the proposed method substantially outperforms the Euclidean counterparts, especially when with a small training size. Further, the learned hyperbolic video-text embeddings well generalize to novel datasets containing complex videos with varied-level action compositions.

Supplementary Material: zip

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4637

Loading