REVEAL: Advancing Relation-based Video Understanding for Video-Question-Answering

Sofian Chaybouti; Walid Bousselham; Moritz Wolter; Hilde Kuehne

REVEAL: Advancing Relation-based Video Understanding for Video-Question-Answering

Sofian Chaybouti, Walid Bousselham, Moritz Wolter, Hilde Kuehne

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Representation Learning, Video-Text Contrastive Alignment, Video-Relations Alignement

TL;DR: We introduce the Many-to-Many Contrastive Noise Estimation loss to match a single video to many text-based relations derived from dense captions.

Abstract: Video Question-Answering (Video-QA) comprises the capturing of complex visual relation changes over time, remaining a challenge even for advanced Vision-Language Models (VLM), i.a., because of the need to represent the visual content to a reasonably sized input for those models. To address this problem, we propose RElation-based Video rEpresentAtion Learning (REVEAL), a framework designed to capture visual relation information by encoding it into structured, decomposed representations. Specifically, inspired by spatiotemporal scene graphs, we propose to encode video sequences as sets of relation triplets in the form of (\textit{subject-predicate-object}) over time via their language embeddings. To this end, we extract explicit relations from video captions and introduce a Many-to-Many Noise Contrastive Estimation loss (MM-NCE) together with a Q-Former architecture to align an unordered set of video-derived queries with corresponding text-based relation descriptions. During inference, the resulting Q-former produces an efficient token representation that can serve as input to a VLM for Video-QA. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA. It shows that the resulting query-based video representation is able to outperform global alignment-based CLS or patch token representations and achieves competitive results against state-of-the-art models, particularly on tasks requiring temporal reasoning and relation comprehension. The code and models will be publicly released upon acceptance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6947

Loading