Keywords: event, multimodal learning, long sequence, language and vision
TL;DR: We uses spatiotemporal compression and two‑stage cross‑modal optimization to condense long event streams and we build a novel event–text dataset and multi‑task benchmark, boosting descriptive accuracy and semantic understanding.
Abstract: Event cameras operate asynchronously with microsecond-level temporal precision and generate sparse event streams, enabling low-latency visual perception under high dynamic range conditions. However, current multimodal large language models (MLLMs) remain suboptimal when handling such data: they either fail to effectively interpret event streams or are limited to very short temporal sequences. To address this problem, we propose a unified approach for long event-stream–text understanding. This method employs an adaptive compression mechanism that significantly reduces input volume while preserving key motion and structural cues, thereby supporting long-term cross-modal reasoning. The training pipeline adopts a two-stage optimization process: the model is first guided to develop representational capacity for streaming data, followed by cross-modal alignment to enhance semantic consistency between event and textual modalities. To handle the substantial temporal information inherent in long event streams, the model uses text-guided cross-modal queries to select salient features and combines hierarchical clustering with similarity scoring to extract representative event segments. During training, a large-scale event–text aligned dataset is curated and constructed, facilitating more effective embedding of event features within the semantic space of language models. In addition, we establish a comprehensive benchmark covering a diverse set of tasks including reasoning, captioning, classification, temporal localization, and moment retrieval. Experimental results demonstrate that the proposed approach outperforms existing state-of-the-art MLLMs in both descriptive accuracy and semantic understanding on long-duration event streams. All datasets, code, and models will be released publicly.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 25042
Loading