Keywords: object detection, multi-scale feature fusion, feature alignment, cross-level attention, lightweight attention
TL;DR: We propose a broadly applicable and lightweight semantic alignment module that improves multi-scale feature fusion via linear cross-level attention paired with a spatial bottleneck design.
Abstract: Feature fusion networks are essential components in modern object detectors, aggregating multi-scale features from hierarchical levels to detect objects of varying sizes.
However, a significant challenge is that fusing features from different levels often leads to semantic inconsistency due to their distinct representations.
While many prior works have attempted to address this, they often incur substantial computational and parameter overhead, limiting their real-time applicability, and in some cases lack generality across different detection architectures.
In this work, we propose a novel lightweight semantic alignment module called Feature Interaction NEtwork (FINE).
This module refines low-level features by integrating high-level contextual cues via a cross-level attention mechanism prior to fusion.
To minimize overhead, FINE combines a kernel-based linear attention with a novel spatial bottleneck design.
This design drastically reduces the attention sequence length while preserving the channel-wise semantics essential for effective semantic alignment.
FINE is generally applicable to various detectors, including Faster R-CNN, YOLO series, and RT-DETR, and
consistently improves detection accuracy without compromising efficiency.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6418
Loading