WavBriVL: Robust Audio Representation and Generation of Audio Driven Diffusion ModelsDownload PDF


17 Apr 2023ACL ARR 2023 April Blind SubmissionReaders: Everyone
Abstract: Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we propose a novel audio representation learning method called WavBriVL, which is based on Bridging-Vision-and-Language (BriVL). WavBriVL embeds audio, image, and text into a shared space, enabling the realization of various multimodal applications. Our approach addresses major challenges in robust audio representation learning and effectively captures the correlation between audio and image. Additionally, we demonstrate the qualitative evaluation of the generated images from WavBriVL, which serves to highlight the potential of our approach in creating images from audio. Overall, our experimental results demonstrate the efficacy of WavBriVL in downstream tasks and its ability to generate appropriate images from audio. The proposed approach has the potential for various applications such as speech recognition, music signal processing, and captioning systems. We would like to highlight that WavBriVL is the first universal method for generating images from audio-driven diffusion models.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
0 Replies
