AutoSFX: Automatic Sound Effect Generation for Videos

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Sound Effect (SFX) generation, primarily aims to automatically produce sound waves for sounding visual objects in images or videos. Rather than learning an automatic solution to this task, we aim to propose a much broader system, AutoSFX, significantly applicable and less time-consuming, \ie automating sound design for videos. Our key insight is that ensuring consistency between auditory and visual information, performing seamless transitions between sound clips, and harmoniously mixing sounds playing simultaneously, is crucial for creating a unified audiovisual experience. AutoSFX capitalizes on this concept by aggregating multimodal representations by cross-attention and leverages a diffusion model to generate sound with visual information embedded. AutoSFX also optimizes the generated sounds to render the entire soundtrack for the input video, leading to a more immersive and engaging multimedia experience. We have developed a user-friendly interface for AutoSFX enabling users to interactively engage in the SFX generation for their videos with particular needs. To validate the capability of our vision-to-sound generation, we conducted comprehensive experiments and analyses using the widely recognized VEGAS and VGGSound test sets, yielding promising results. We also conducted a user study to evaluate the performance of the optimized soundtrack and the usability of the interface. Overall, the results revealed that our AutoSFX provides a viable sound landscape solution for making attractive videos.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: The relevance of our work to the conference is multifaceted. First, we propose a diffusion-based model to facilitate sound generation guided by visual information. Based on the SAM model, we explore the potential opportunities arising from the visual segmentation into the audio generation of approaches. Second, based on the generation model, we propose a system, AutoSFX, which aims to make the sound generation technique applicable for sound design and reduce human effort during the design process. With the generated sound clips for objects in the video, AutoSFX further performs optimization to convey seamless transitions and harmonious mixing. This aligns with the conference's focus on innovative multimedia technologies and their applications. Finally, we also develop a user-friendly interface for creators to interactively engage in the SFX generation for their videos with particular needs, e.g., specify the objects to be sounded. In summary, AutoSFX addresses a critical need within the multimedia community for tools that can streamline the sound design process, making it more accessible and efficient for creators. The evaluation results also demonstrate that AutoSFX provides a viable sound landscape solution for making attractive videos.
Supplementary Material: zip
Submission Number: 2603
Loading