DIME: Tackling Density Imbalance for High-Performance and Low-Latency Event-Based Object Detection

ICLR 2026 Conference Submission16375 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: event; object detection
Abstract: Event-based object detection takes advantages of the high temporal resolution and dynamic range of event cameras, offering significant benefits in scenarios involving fast motion and challenging lighting conditions. Typically, event streams are first converted into frame sequences through frame-based representations, followed by spatiotemporal feature fusion, similar to video processing. However, video-based processing methods overlook the sparse and non-uniform nature of event streams, making them inadequate for meeting the effectiveness and low-latency processing demands. To address these challenges, we rebuild the spatiotemporal dependency model of event stream by focusing on three key aspects: First, we design a spatiotemporal linear attention to direct build dependencies at patch-level while maintaining spatial parallelism; Second, we incorporate a frame-level temporal decay and spatial position encoding mechanism into the linear attention, which adaptively adjusts the internal state of the network based on the frame information; Third, we propose a structure-level local and global linear attention architecture, which extract event features based on our linear model at different granularities. Our model achieves SOTA performance on Gen1 and 1Mpx datasets, firstly surpassing 50\% mAP on 1Mpx with a compact size, while reducing parameters by 3.2× and runtime by 5.1× compared to similar-performing methods, and outperforming lightweight models by +4.3\% mAP.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16375
Loading