CRAFT: Cross-Representation modeling on Audio waveForms and specTrograms

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Audio representation learning, contrastive learning, audio tagging
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: In this paper, we introduce \underline{C}ross-\underline{R}epresentation modeling on \underline{A}udio wave\underline{F}orms and spec\underline{T}rograms (CRAFT), an innovative representation modeling designed to extract joint features from diverse representations in the audio modality, and choose acoustic classification to showcase the effectiveness of our approach. Historically, most prior works are focused on utilizing either the frequency-domain spectrogram or the time-domain waveform representations for acoustic modeling. Directly fusing or concatenating individual representations suffers from performance degradation. However, we argue that by aligning these individual representations effectively, they can complement each other and substantially enhance the quality of downstream tasks. To mitigate semantic misalignment, we initially propose a cross-representation contrastive learning framework incorporating spectrogram and waveform based contrastive learning loss in audio pretraining. Subsequently, to alleviate temporal misalignment, we present a cross-representation transformer architecture, which models on spectrogram and waveform tokens together with fusion bottlenecks. The proposed CRAFT is tested on two commonly used datasets, demonstrating superior performances. Notably, our proposed CRAFT method outperforms the spectrogram-based counterpart by an impressive 4.4\% higher mAP on AudioSet balanced set, and achieves SOTA comparable performances on full set, which suggests the alleviation of semantic misalignment and temporal misalignment boosts cross-representation performances in audio modeling. All codes and models will be open-sourced.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3776
Loading