SpeedAug: A Simple Co-Augmentation Method for Unsupervised Audio-Visual Pre-training

Jiangliu Wang; Jianbo Jiao; Yibing Song; Stephen James; Zhan Tong; Pieter Abbeel; Yun-Hui Liu

SpeedAug: A Simple Co-Augmentation Method for Unsupervised Audio-Visual Pre-training

Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen James, Zhan Tong, Pieter Abbeel, Yun-Hui Liu

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Abstract: We present a speed co-augmentation method for unsupervised audio-visual pre-training. A playback speed is randomly selected and applied to both audio and video data to diversify audio-visual views. By applying this augmentation, we observe an interesting phenomenon that multi-modal co-augmentation leads to data entanglement and even semantic meaning shift (e.g., a sped-up sound from a cat can be mistaken as the sound from a mouse). This differs from the common intuition in single-modality representation learning, where samples are invariant to different augmentations. To combat this, augmented audio-visual views are formulated as a partial relationship via our proposed SoftInfoNCE during unsupervised pre-training. The learned representations are evaluated on three downstream tasks, including action recognition and video retrieval on the UCF101 and HMDB51 datasets, and video-audio retrieval on the Kinetics-Sounds dataset. Extensive experimental results show that we achieve a new state-of-the-art.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning

5 Replies

Loading