Abstract: Hypothesis inference, a sophisticated cognitive process that allows humans to construct plausible explanations for incomplete observations, is paramount to our ability to make sense of the world around us. Despite the universality of this skill, it remains under-explored within the context of multi-modal AI, which necessitates analyzing observation, recalling information in the mind, and generating explanations. In this work, we propose the Cross-modal Observation hypothesIs iNference task (COIN). Given a textual description of a partially observed event, COIN strives to recall the most probable event from the visual mind (video pool), and infer the subsequent action flow connecting the visual mind event and the observed textural event. To advance the development of this field, we propose a large-scale text-video dataset, Tex-COIN, that contains 39,796 meticulously annotated hypothesis inference examples and auxiliary commonsense knowledge (appearance, clothing, action, etc.) for key video characters. Based on the proposed Tex-COIN dataset, we design a strong baseline, COINNet, which features two perspectives: 1) aligning temporally displaced textual observations with target videos via transformer-based multi-task learning, and 2) inferring the action flow with non-parametric graph-based inference grounded in graph theory. Extensive experiments on the Tex-COIN dataset validate the effectiveness of our COINNet by significantly outperforming the state-of-the-arts.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: We propose the Cross-modal Observation Hypothesis Inference task (COIN) by simulating human cognition.
Supplementary Material: zip
Submission Number: 4894
Loading