VilNMN: A Neural Module Network approach to Video-Grounded Language TasksDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: neural modular networks, video-grounded dialogues, dialogue understanding, video understanding, video QA, video-grounded language tasks
Abstract: Neural module networks (NMN) have achieved success in image-grounded tasks such as question answering (QA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded language tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance. Motivated by recent NMN approaches on image-grounded tasks, we introduce Visio-Linguistic Neural Module Network (VilNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VilNMN first decomposes all language components to explicitly resolves entity references and detect corresponding action-based inputs from the question. Detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VilNMN can achieve promising performance on two video-grounded language tasks: video QA and video-grounded dialogues.
One-sentence Summary: We propose VilNMN, a novel neural module network approach for video-grounded language tasks, which achieves strong performance in both video QA and video-grounded dialogue tasks.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=TpPiT5OIca
9 Replies

Loading