Audio-Journey: Efficient Visual+LLM-aided Audio Encodec Diffusion

Published: 20 Jun 2023, Last Modified: 16 Jul 2023ES-FoMO 2023 PosterEveryoneRevisionsBibTeX
Keywords: Efficient Diffusion; Audio generation; Audio-visual fusion;
TL;DR: We efficiently trained an Audio Diffusion model with the aid of LLM model;
Abstract: Despite recent progress, machine learning for the audio domain is limited by the availability of high-quality data. Visual information already presented in a video should complement the information in audio. In this paper, we leverage state-of-the-art (SOTA) Large Language Models (LLMs) to augment the existing weak labels of the audio dataset to enrich captions; we adopt SOTA video-captioning model to automatically generate video caption, and we again use LLMs to merge the audio-visual captions to form a rich dataset of large-scale. Using this dataset, we train a latent diffusion model on the Encodec embeddings. Furthermore, we leverage the trained diffusion model to generate even more audio data of the same format. In our experiment, we first verified that our Audio+Visual Caption is of high quality against baselines and ground truth (12.5\% gain in semantic score against baselines). Moreover, we demonstrate that we could train a classifier from scratch using the diffusion-generated data, or use diffusion to enhance classification models on the AudioSet test set, working in conjunction with mixup or other augmentation methods for impressive performance gains. Our approach exemplifies a promising method for augmenting low-resource audio datasets. The samples, models, and implementation will be at \url{}.
Submission Number: 17