MIMOSA: Human-AI Co-Creation of Computational Spatial Audio Effects on Videos

Published: 01 Jan 2024, Last Modified: 23 Oct 2024Creativity & Cognition 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Spatial audio offers more immersive video consumption experiences to viewers; however, creating and editing spatial audio often expensive and requires specialized hardware equipment and skills, posing a high barrier for amateur video creators. We present Mimosa, a human-AI co-creation tool that enables amateur users to computationally generate and manipulate spatial audio effects. For a video with only monaural or stereo audio, Mimosa automatically grounds each sound source to the corresponding sounding object in the visual scene and enables users to further validate and fix errors in the location of the sounding objects. Users can also augment the spatial audio effect by flexibly manipulating the sounding source positions and creatively customizing the audio effect. The design of Mimosa exemplifies a human-AI collaboration approach that, instead of utilizing state-of-art end-to-end “black-box” ML models, uses a multistep pipeline that aligns its interpretable intermediate results with the user’s workflow. A lab user study with 15 participants demonstrates Mimosa’s usability, usefulness, expressiveness, and capability in creating immersive spatial audio effects in collaboration with users.
Loading