Abstract: Conventional RGB cameras struggle in high-speed vision due to motion blur (above 60Hz sampling) and limited dynamic range (<60dB). To address these limitations, we propose a multimodal framework integrating event cameras, leveraging their microsecond temporal resolution (1μs) and 140dB dynamic range. Our key innovations include: (1) DSF-Net: An innovative spike-triggered dynamic sparse fusion network that effectively and efficiently fuses discriminative features from Event-RGB, enabling high-speed object detection; (2) HS-Multi: The first large-scale Event-RGB dataset specifically designed for high-speed objects, featuring 73k annotated samples across 11 object categories, with dedicated high-speed settings (HS-CAR and HS-FAN). Extensive evaluations on three benchmarks (HS-CAR, HS-FAN, PKU-DDD17-Car) demonstrate consistent advantages: (a) superiority on high-speed detection, DSF-Net significantly surpasses both unimodal (RGB/Event) and existing multimodal fusion methods, with particular superiority on HS-CAR where it achieves 87.3% mAP (9.5%↑ vs. RGB-only); (b) Generalization capability: DSF-Net achieves 50.1% mAP on PKU-DDD17-Car, surpassing prior multimodal framework in both accuracy (+4.1%) and speed (+8.3fps).
External IDs:doi:10.1145/3746027.3755846
Loading