Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving
Abstract: Autonomous driving systems rely heavily on the underly-
ing perception module which needs to be both performant
and efficient to allow precise decisions in real-time. Avoid-
ing collisions with pedestrians is of topmost priority in any
autonomous driving system. Therefore, pedestrian detec-
tion is one of the core parts of such systems’ perception
modules. Current state-of-the-art pedestrian detectors have
two major issues. Firstly, they have long inference times
which affect the efficiency of the whole perception module,
and secondly, their performance in the case of small and
heavily occluded pedestrians is poor. We propose Local-
ized Semantic Feature Mixers (LSFM), a novel, anchor-free
pedestrian detection architecture. It uses our novel Super
Pixel Pyramid Pooling module instead of the, computation-
ally costly, Feature Pyramid Networks for feature encod-
ing. Moreover, our MLPMixer-based Dense Focal Detec-
tion Network is used as a light detection head, reducing
computational effort and inference time compared to ex-
isting approaches. To boost the performance of the pro-
posed architecture, we adapt and use mixup augmentation
which improves the performance, especially in small and
heavily occluded cases. We benchmark LSFM against the
state-of-the-art on well-established traffic scene pedestrian
datasets. The proposed LSFM achieves state-of-the-art per-
formance in Caltech, City Persons, Euro City Persons, and
TJU-Traffic-Pedestrian datasets while reducing the infer-
ence time on average by 55%. Further, LSFM beats the
human baseline for the first time in the history of pedestrian
detection. Finally, we conducted a cross-dataset evaluation
which proved that our proposed LSFM generalizes well to
unseen data.
Loading