Keywords: Video Understanding, Multimodal Large Language Model, Temporal Reasoning, Event Understanding
Abstract: Multimodal large language models (MLLMs) have demonstrated strong performance in understanding holistic videos. However, their ability to process streaming events—represented as sequences of visual clips—remains underexplored. Intuitively, leveraging prior events as memory can enrich the contextual and temporal understanding of current events. Inspired by this, we show in this paper that using preceding events as context, i.e., memory, helps MLLMs better comprehend video events. However, such memory relies on MLLMs’ predictions of prior events and inevitably accumulates misinformation in a streaming setting, leading to confabulation in contexts and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates the impact of confabulated memory for improved memory-enhanced event understanding.
Submission Number: 78
Loading