Keywords: Multi-modal Vision, Video Understanding
TL;DR: We introduce Video-EM, a training‑free framework that treats long video question answering as an episodic memory retrieval-and‑reasoning problem inspired by human cognitive psychology.
Abstract: Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context-window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text-image matching, overlooking spatio-temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training-free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain-of-thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on multiple mainstream long-video benchmarks demonstrate the superiority of Video-EM, which achieves highly competitive results while using fewer frames.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8370
Loading