Abstract: Synthesizing binaural audio according to personalized requirements is crucial for building immersive artificial spaces. Previous methods employ visual modalities to guide the spatialization of audio because it can provide spatial information about objects. However, the paradigm is dependent on object visibility and strict audiovisual correspondence, which makes it tough to satisfy personalized requirements. In addition, the visual counterpart to the audio may be crippled or even non-existent, which greatly limits the development of the field. To this end, we advocate exploring a novel task known as Text-guided Audio Spatialization (TAS), in which the goal is to convert mono audio into spatial audio based on text prompts. This approach circumvents harsh audiovisual conditions and allows for more flexible individualization. To facilitate this research, we construct the first TASBench dataset. The dataset provides a dense frame-level description of the spatial location of sounding objects in audio, enabling fine-grained spatial control. Since text prompts contain multiple sounding objects and spatial locations, the core issue of TAS is to establish the mapping relationship between text semantic information and audio objects. To tackle this issue, we design a Semantic-Aware Fusion (SAF) module to capture text-aware audio features and propose a text-guided diffusion model to learn the spatialization of audio, which can generate spatial audio consistent with text prompts. Extensive experiments on TASBench compare the proposed method with several methods from related tasks, demonstrating that our method is promising to achieve the personalized generation of spatial sense of audio under text prompts.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: We bring a new task and method to the multimodal domain, namely text-guided audio spatialization. It is a new multi-modal interaction problem. Our experimental results show that it is a promising method to convert mono audio into spatial audio by exploiting text prompts. The generated spatial audio has broad application prospects, such as augmented reality, virtual reality, video editing, movies, etc.
Supplementary Material: zip
Submission Number: 513
Loading