Sonic VisionLM: Playing Sound with Vision Language Models

Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

Published: 01 Jan 2024, Last Modified: 13 Nov 2024CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: There has been a growing interest in the task of generating sound for silent videos, primarily because of its prac-ticality in streamlining video post-production. However, existing methods for video-sound generation attempt to di-rectly create sound from visual representations, which can be challenging due to the difficulty of aligning visual rep-resentations with audio representations. In this paper, we present Sonic VisionLM, a novel framework aimed at gen-erating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio di-rectly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first iden-tifies events within the video using a VLM to suggest pos-sible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommen-dations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and de-veloped a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/