Video Entailment via Reaching a Structure-Aware Cross-modal Consensus

Published: 01 Jan 2023, Last Modified: 13 Nov 2024ACM Multimedia 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper targets at the task of video entailment, which aims to achieve a thorough comprehension and draw inferences on whether a natural language statement entails or contradicts a given multi-modal video. Despite the recent progress, most existing methods focus on designing a vision-language encoder for multi-modal feature extraction in video entailment, which ignore the underlying consensus knowledge between two modalities, hindering the reasoning performance. As human beings, we make sense of the world by synthesizing information from different sense perceptions, which can acquire consensus among multiple modalities to form a more thorough and coherent representation of the surroundings, as well as to perform complicated understanding tasks. In this paper, we attempt to recreate this ability to infer the truthfulness of a given statement in the context of video entailment. To this end, we propose a unified structure-aware cross-modal consensus method to excavate the consensus semantics shared between video and language modalities, thereby incorporating which into video entailment as statement-related clues. Specifically, the consensus information is achieved by filtering away redundant information by utilizing the global information from one modality and the local complementary information from the other one. Moreover, a consensus-guided graph reasoning method is designed to explore inter-modality consistency and emphasize the significant features related to the judged statement, generating the inference results. Extensive experiments on two benchmarks demonstrate the accurate and robust performance of our approach compared to state-of-the-arts. Code is available at https://github.com/Feliciaxyao/MM2023-SACCN.
Loading