Keywords: Video Reasoning, External Memory, Neural Sampling
Abstract: Long-form video reasoning is essential for various applications such as video retrieval, summarizing, and question
answering. However, existing methods often require significant computational resources and are limited by GPU memory constraints. To address this challenge, we present Long-Video Memory Network, LVM-NET, a novel video reasoning method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. By leveraging a neural sampler that identifies discriminative memory tokens, LVM-Net achieves improved efficiency. Furthermore, LVM-Net only requires a single pass over the video, further enhancing overall efficiency. Our results on the Rest-ADL dataset demonstrate an 18x - 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10964
Loading