Unsupervised Image-to-Video Adaptation via Category-aware Flow Memory Bank and Realistic Video Generation
Abstract: Image-to-Video adaptation is proposed to train a model using labeled images and unlabeled videos to facilitate the classification of unlabeled videos.
The latest work synthesizes videos using still images to mitigate the modality gap between images and videos. However, the synthesized videos are not realistic due to the camera movements are only simulated in 2D space. Therefore, we generate realistic videos by simulating arbitrary camera movements in 3D scenes, and then the model can be trained using the generated source videos.
Unfortunately, the optical flows from the generated videos have unexpected negative impacts, resulting in suboptimal performance. To address this issue, we propose the Category-aware Flow Memory Bank, which replaces optical flows in source videos with real target flows, and the new composed videos are beneficial for training.
In addition, we leverage the video pace prediction task to enhance the speed awareness of the model in order to solve the problem that the model performs poorly in handling some categories with similar appearances but significant speed differences. Our method achieves state-of-the-art performance and comparable performance on three Image-to-Video benchmarks.
Primary Subject Area: [Content] Media Interpretation
Relevance To Conference: Training a video recognition model from scratch is costly and time comsuming. However,there are numerous image datasets accessible and images are easier to be collected and annotated than videos. Therefore, Image-to-Video adaptation becomes an active research direction in the field of multimedia. We propose a novel method which can be used to train a video recognition model using labeled images and unlabeled videos. And the trained model can be used to classify unlabeled videos effectively. Specifically, we simulate the movements of the camera in 3D space and save new perspective images as video frames. In this way, the generated video are more realistic and beneficial for training a discriminative spatio-temporal model. In order to mitigate the significant discrepancies between the flow data in source and target videos, we construct a Category-aware Flow Memory Bank and replace the optical flows in source videos with real target flows under the guidance of pseudo labels. The new composed videos greatly improve the performance of the model. Finally, we leverage the video pace prediction task to enhance the speed awareness of the model in order to solve the problem that the model performs poorly in handling some categories with similar appearances but significant speed differences. We have validated the effectiveness of our method through extensive experiments.In particular, we achieved state-of-the-art performance on the challenging task E→H.
Supplementary Material: zip
Submission Number: 2362
Loading