Abstract: Recent studies on video object detection have shown the advantages of aggregating features across frames to capture temporal information, which can mitigate appearance degradation, such as occlusion, motion blur, and defocus. However, these methods often employ a sliding window or memory queue to store temporal information frame by frame, leading to discarding features of earlier frames over time. To address this, we propose a dual-memory feature aggregation framework (DMFA). DMFA simultaneously constructs a local feature cache and a global feature memory in a feature-wise updating way at different granularities, i.e., pixel level and proposal level. This approach can partially preserve key features across frames. The local feature cache stores the spatio-temporal contexts from nearby frames to boost the localization capacity, while the global feature memory enhances semantic feature representation by capturing temporal information from all previous frames. Moreover, we introduce contrastive learning to improve the discriminability of temporal features, resulting in more accurate proposal-level feature aggregation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the ImageNet VID benchmark.
Loading