EquiAV: Single-modal Equivariance Promotes Audio-Visual Contrastive Learning

Jongsuk Kim; Hyeongkeun Lee; Kyeongha Rho; Junmo Kim; Joon Son Chung

EquiAV: Single-modal Equivariance Promotes Audio-Visual Contrastive Learning

Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Audio-Visual Contrastive Learning, Multimodal Representation Learning, Equivariant Contrastive Learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present EquiAV, a novel framework that integrates single-modal equivariant contrastive learning with audio-visual self-supervised learning.

Abstract: Advancements in audio-visual representation learning have showcased its effectiveness in acquiring rich and comprehensive representations by leveraging both auditory and visual modalities. Recent works have attempted to improve performance using contrastive learning or masked modeling techniques. However, the effort to maximize the impact of data augmentations for learning semantically rich representation has remained relatively narrow. Without a proper strategy for utilizing data augmentation, the model can be adversely affected or fail to achieve sufficient performance gains. To address this limitation, we present EquiAV, a novel framework that integrates single-modal equivariant contrastive learning with audio-visual contrastive learning. In the proposed framework, audio-visual correspondence and rich modality-specific representations are learned in separate latent spaces. In particular, augmentation-related and modality-specific information is learned in the intra-modal latent space by making the representations equivariant to data augmentation. Extensive ablation studies verify that our framework is the most suitable architecture for maximizing the benefits of the augmentation while ensuring model robustness to strong augmentation. EquiAV outperforms the existing audio-visual self-supervised pre-training methods on audio-visual event classification and zero-shot audio-visual retrieval tasks.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5149

Loading