TL;DR: We introduce OmniAudio, the first 360-degree video-to-spatial audio generation model.
Abstract: Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, \textbf{360V2SA}, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create \textbf{Sphere360}, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework \textbf{OmniAudio}, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at~\href{https://github.com/liuhuadai/OmniAudio}{\texttt{github.com/liuhuadai/OmniAudio}}. The project website is available at \href{https://OmniAudio-360V2SA.github.io}{\texttt{OmniAudio-360V2SA.github.io}}.
Lay Summary: Imagine watching a video where you can hear sounds coming from all around you, just like in real life. This is called spatial audio, and it makes videos feel more immersive and realistic. Our research focuses on creating this type of audio for 360-degree videos, which let you look in any direction.
We've developed a new method called OmniAudio that can generate spatial audio from 360-degree videos. This means that when you watch a 360-degree video with the audio our system creates, you'll hear sounds as if they're coming from their actual locations in the video scene.
To make this possible, we first created a large collection of 360-degree videos with matching spatial audio, called Sphere360. We then designed OmniAudio, a smart computer system that learns to understand both the full 360-degree view and a focused front view of the video. By combining these two perspectives, OmniAudio can accurately place sounds in the right locations.
Our tests show that OmniAudio performs better than existing methods, creating more realistic and accurate spatial audio for 360-degree videos. This technology could enhance various applications, from virtual reality experiences to more immersive video content, making viewers feel like they're truly part of the scene they're watching.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/liuhuadai/OmniAudio
Primary Area: Applications->Language, Speech and Dialog
Keywords: video-to-audio generation, 360-degree video-to-spatial audio generation, spatial audio generation
Submission Number: 2049
Loading