Adaptive Memory Mechanism in Vision Transformer for Long-form Video Understanding

Zhenshun Liu; Zijian Lei; Kejing Yin; William K. Cheung

Adaptive Memory Mechanism in Vision Transformer for Long-form Video Understanding

Zhenshun Liu, Zijian Lei, Kejing Yin, William K. Cheung

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Key-Value Cache, Vision Transformer, Video Understanding

TL;DR: We introduce the Adaptive Memory Vision Transformer (AMViT), which dynamically adjusts its Temporal Receptive Field using an Adaptive Memory Mechanism for effectiveness and efficiency improvement in long-form video understanding.

Abstract: In long-form video understanding, selecting an optimal Temporal Receptive Field (TRF) is crucial for Vision Transformer (ViT) models due to the dynamic nature of diverse video motion contents, which varies in duration and velocity. A short TRF can result in loss of critical information, while a long TRF may decrease ViT's performance and computational efficiency caused by the unrelated contents in videos and the quadratic complexity of the attention mechanism. To tackle this issue, we introduce Adaptive Memory Mechanism (AMM) that enables ViT to adjust its TRF dynamically in response to the video's dynamic contents. Instead of discarding Key-Value (KV) Cache from the earliest inference when the settings limit is reached, our approach uses a Memory Bank (MB) to retain the most important embeddings from the Key-Value Cache that would otherwise be discarded in memory-augmented methods. The selection is based on the attention score calculated between the Class Token (CLS) in current iteration and the KV Cache in previous iterations. We demonstrate that Adaptive Memory Vision Transformer (AMViT) outperforms existing methods across a diverse array of tasks (action recognition, action anticipation, and action detection).

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2872

Loading