ConCAP: Contrastive Context-Aware Prompt for Resource-hungry Action Recognition

04 Nov 2023OpenReview Archive Direct UploadReaders: Everyone
Abstract: Existing large-scale image-language pre-trained models, e.g., CLIP, have revealed strong spatial recognition capability on various vision tasks. However, they achieve inferior performance in action recognition due to lack of temporal reasoning ability. Moreover, fully tuning large models require expensive computational infrastructures, and state-of-the-art video models yield slow inference speed due to the high frame sampling rate. The above drawbacks make existing video action recognition works impractical to be applied in resource-hungry scenarios, which is common in the real world. In this work, we propose Contrastive Context-Aware Prompt (ConCAP) for resource-hungry action recognition. Specifically, we develop a lightweight PromptFormer to learn the spatio-temporal representations stacking on top of frozen frame-wise visual backbones, where learnable prompt tokens are plugged between frame tokens during selfattention. These prompt tokens are expected to auto-complete the contextual spatiotemporal information between frames and therefore enhance the model’s representation capability. To achieve this goal, we align the prompt-enhanced representation with both category-level textual representations and video representations from densely sampled frames. Extensive experiments on four video benchmarks show that we achieve state-of-the-art or competitive performance compared to existing methods with far fewer trainable parameters and faster inference speed with limited frames, demonstrating the superiority of ConCAP in resource-hungry scenarios.
0 Replies

Loading