End-to-End Object Detection with YOLOF

Published: 01 Jan 2024, Last Modified: 17 Apr 2025ICIC (7) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Within the field of computer vision, object detection is a core issue. A technique extensively utilized in convolution-oriented detectors is Non-Maximum Suppression (NMS), designed to suppress redundant predictions. However, the sequential nature intrinsic to NMS inhibits its capacity for parallel execution, consequently restricting the inference speed. Furthermore, the recall rate of detectors with NMS is also affected in scenes with high object density and overlap. In this paper, we propose a real-time and end-to-end detector with YOLOF (You Only Look One-level Feature). The proposed methods do not introduce additional parameters or attention mechanisms, making them practical for real-time applications. Specifically, we propose the stop-gradient strategy to train only a portion of parameters to address the problem of weak supervision in one-to-one label assignment. We also present auxiliary losses to strengthen the supervision of negative samples during training and use semantic anchor optimization to suppress other anchors in the same location. These techniques allow the improved YOLOF to discard NMS within a 1 mAP gap and achieve faster inference speed. Our YOLOF-CSP-D53-DC5 achieves 42.7 mAP, only 0.5 mAP lower than the original version. Additionally, our YOLOF-R50 achieves a 37.1 mAP at 38 FPS and exceeds state-of-the-art networks by more than 1.5 times in inference speed.
Loading