Self-Adaptive Sampling for Accurate Few-Frame Video Question Answering in Image Text ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: video question answering; image text model
Abstract: Image–text models (ITMs) is the prevalent architecture to solve video question–answering tasks, which requires only a few input frames to save huge computational cost compared to video–language models. However, we find existent ITM video question–answering solutions either 1) adopt simplistic and unintentional sampling strategies, which may miss key frames to offer the answer clues; or 2) sample a large number of frames into divided groups, which the computational sources can not accommodate. In this work, we aim at an efficient sampling method towards the few-frame situations. We first summarize a family of prior sampling methods based on question–frame correlation into a unified one, dubbed most implied frames. Through some primary results and analysis, we form our hypothesis from which we further propose the other method most dominant frames. Experimental results on four public datasets and three advanced ITMs demonstrate that our proposed strategies can boost the performance for image–text pretrained models, and have a wide application scenario in terms of model architectures and dataset types.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview