SpeedAug: A Simple Co-Augmentation Method for Unsupervised Audio-Visual Pre-trainingDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Abstract: We present a speed co-augmentation method for unsupervised audio-visual pre-training. A playback speed is randomly selected and applied to both audio and video data to diversify audio-visual views. By applying this augmentation, we observe an interesting phenomenon that multi-modal co-augmentation leads to data entanglement and even semantic meaning shift (e.g., a sped-up sound from a cat can be mistaken as the sound from a mouse). This differs from the common intuition in single-modality representation learning, where samples are invariant to different augmentations. To combat this, augmented audio-visual views are formulated as a partial relationship via our proposed SoftInfoNCE during unsupervised pre-training. The learned representations are evaluated on three downstream tasks, including action recognition and video retrieval on the UCF101 and HMDB51 datasets, and video-audio retrieval on the Kinetics-Sounds dataset. Extensive experimental results show that we achieve a new state-of-the-art.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning
5 Replies

Loading