Abstract: The Detection Transformer (DETR), by incorporating the Hungarian algorithm, has significantly simplified the matching process in object detection tasks. This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training. While effective, this strict matching process does not inherently account for the varying densities and distributions of objects, leading to suboptimal correspondences such as failing to handle multiple detections of the same object or missing small objects. To address this, we propose the Regularized Transport Plan (RTP). RTP introduces a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences between these sets. By utilizing the differentiable Sinkhorn algorithm, RTP allows for soft, fractional matching rather than strict one-to-one assignments. This approach enhances the model's capability to manage varying object densities and distributions effectively. Our extensive evaluations on the MS-COCO and VOC benchmarks demonstrate the effectiveness of our approach. RTP-DETR, surpassing the performance of the Deform-DETR and the recently introduced DINO-DETR, achieving absolute gains in mAP of {\bf{+3.8\%}} and {\bf{+1.7\%}}, respectively.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Our work focuses on advancing object detection capabilities by optimising and evaluating the Detection Transformer (DETR) model, particularly through the novel use of optimal transport as the matching algorithm. This approach has been extensively tested across several image datasets, leading to a comprehensive performance analysis that sets new benchmarks for accuracy and efficiency in this domain. By including transportation plan theory in DETR’s matching process, it not only improves the precision of detecting objects but also simplifies the efficiency of the model. We believe our findings provide valuable contributions to the "Multimedia Content Understanding" theme, specifically within the realms of "Vision and Language" and "Multimedia Interpretation". Our detector enhances the foundational technology necessary for complex visual data interpretation, which is crucial for the development of increasingly advanced multimedia applications. Moreover, our proposed detector has a broader impact on multimedia systems, contributing to their efficiency and reliability. This has potential applications in real-time processing and interpretation of visual data, which are critical for a wide range of multimedia applications.
Supplementary Material: zip
Submission Number: 5045
Loading