Abstract: Video-language pre-training (VLP) has attracted increasing attention for cross-modality understanding tasks. To enhance visual representations, recent works attempt to adopt transformer-based architectures as video encoders. These works usually focus on the visual representations of the sampled frames. Compared with frame representations, frame patches incorporate more fine-grained spatio-temporal information, which could lead to a better understanding of video contents. However, how to exploit the spatio-temporal information within frame patches for VLP has been less investigated. In this work, we propose a method to learn tube tokens to model the key spatio-temporal information from frame patches. To this end, multiple semantic centers are introduced to focus on the underlying patterns of frame patches. Based on each semantic center, the spatio-temporal information within frame patches is integrated into a unique tube token. Complementary to frame representations, tube tokens provide detailed clues of video contents. Furthermore, to better align the generated tube tokens and the contents of descriptions, a local alignment mechanism is introduced. The experiments based on a variety of downstream tasks demonstrate the effectiveness of the proposed method.
External IDs:dblp:journals/tmm/ZhuLZYWGCYJ23
Loading