Keywords: Event camera, self-supervised, transformer, probabilistic attention, road segmentation
Abstract: Event cameras offer microsecond latency and exceptional dynamic range, making them a natural fit for road segmentation in autonomous driving. Yet their impact has been limited by the scarce annotations and the high cost of labeling event streams. Current solutions rely on transferring knowledge from RGB domains, but this dependence erases the very advantages that make event sensing unique.
This work break the dependence on RGB with an event-native self-supervised transformer architecture that learns rich event-specific semantics directly from raw unlabeled event streams (no frames through a polarity-guided self-supervised pretext task. To further exploit the spatiotemporal richness of event data, we propose a probabilistic attention mechanism that outperforms standard dot-product attention on this modality.
On DSEC-Semantic and DDD17, our approach achieves state-of-the-art road segmentation with orders of magnitude fewer labels. These results establish self-supervision as a scalable and label-efficient paradigm shift for event-driven vision in autonomous driving.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 19200
Loading