Abstract: Event cameras, with high temporal resolution and high dynamic range, have shown great potential under extreme scenarios such as high-speed movement and low illumination. However, previous event representation methods typically aggregate event data into a single dense tensor, often overlooking the dynamic changes of events within a given time unit. This limitation can introduce historical artifacts and semantic inconsistencies, ultimately degrading model performance. Inspired by human visual prior, we propose a motion and appearance decoupling (MAD) event representation to disentangle the mixed spatial-temporal event tensor into two independent branches. This bio-inspired design helps the network extract discriminative temporal (i.e., motion) and spatial (i.e., appearance) information, thus reducing the network’s learning burden toward complex high-level interpretation tasks. In our method, the event motion guided attention module (EMGA) is designed to achieve temporal and spatial feature interaction and fusion sequentially. Based on EMGA, three specially designed decoder heads are proposed for several representative event-based tasks (i.e., object detection, semantic segmentation, and human pose estimation). Experimental results demonstrate that our method achieves state-of-the-art performance on the above three tasks, which reveals that our method is an easy-to-implement replacement for currently event-based methods. Our code is available at: https://github.com/ChenYichen9527/MAD-representation
External IDs:doi:10.1109/tip.2025.3607632
Loading