Keywords: Long-form Video Understanding, Multimodal Understanding
Abstract: Despite the great advances in video understanding with deep neural networks, current solutions still struggle with input videos that last for minutes, if not hours. To mitigate this issue, existing approaches typically build a memory cache with dense visual embedding on video transformers to model the long-range spatiotemporal dependencies. However, even with hundreds of extended memory tokens, their results remain unsatisfactory.
In this paper, we argue that more compact yet informative memory embeddings can effectively improve performance. To this end, we introduce TinyMem, a model built upon tiny multimodal memory for long-form video action detection. In particular, we condense redundant video content into succinct descriptions to derive abstract text semantics. Subsequently, we integrate visual embedding condensed by regions with text embedding. TinyMem beats a range of state-of-the-art models on AVA v2.2, Epic-Kitchens-100 and Breakfast with highly condensed memory, e.g., 37.4 mAP with TinyMem-24-12 on AVA v2.2 while using 5 times fewer memory tokens than the baseline with dense visual memory embedding.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5517
Loading