Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Published: 06 Mar 2025, Last Modified: 15 Apr 2025ICLR 2025 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Understanding, Multimodal Large Language Model, Temporal Reasoning, Event Understanding
Abstract: Multimodal large language models (MLLMs) have demonstrated strong performance in understanding holistic videos. However, their ability to process streaming events—represented as sequences of visual clips—remains underexplored. Intuitively, leveraging prior events as memory can enrich the contextual and temporal understanding of current events. Inspired by this, we show in this paper that using preceding events as context, i.e., memory, helps MLLMs better comprehend video events. However, such memory relies on MLLMs’ predictions of prior events and inevitably accumulates misinformation in a streaming setting, leading to confabulation in contexts and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates the impact of confabulated memory for improved memory-enhanced event understanding.
Submission Number: 78
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview