Progressive Sparse Local Attention for Video object detection

Chaoxu Guo, Bin Fan, Jie Gu, Qian Zhang, Shiming Xiang, Véronique Prinet, Chunhong Pan

31 Jan 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: Transferring image-based object detectors to the domain of videos remains a challenging problem. Previous efforts mostly exploit optical flow to propagate features across frames, aiming to achieve a good trade-off between accuracy and efficiency. However, introducing an extra model to estimate optical flow can significantly increase the over-all model size. The gap between optical flow and high-level features can also hinder it from establishing spatial correspondence accurately. Instead of relying on optical flow, this paper proposes a novel module called Progressive Sparse Local Attention (PSLA), which establishes the spatial correspondence between features across frames in a local region with progressively sparser stride and uses the correspondence to propagate features. Based on PSLA, Re-cursive Feature Updating (RFU) and Dense Feature Trans-forming (DenseFT) are proposed to model temporal appearance and enrich feature representation respectively in a novel video object detection framework. Experiments onImageNet VID show that our method achieves the best ac-curacy compared to existing methods with smaller model size and acceptable runtime speed.

0 Replies