SoundAct: Learning Spatial Sound Awareness for Egocentric Robot Manipulation with Stereo Audio

Published: 16 May 2026, Last Modified: 20 May 2026ASAB 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Robot Manipulation, Policy Learning, Multi-modal, Sound
Abstract: Humans naturally use auditory and visual cues to interact with objects beyond sight. However, most robot manipulation frameworks rely solely on vision, limiting their ability to handle audio-driven tasks (e.g., ring-off) and out-of-view events. To address this, we propose SoundAct, a sound-aware egocentric robot manipulation framework that integrates stereo microphones with a wrist-mounted camera for beyond-sight spatial audio reasoning. We encode directional cues from stereo audio as magnitude spectrograms and fuse them with visual features via an attention mechanism, enabling the policy to adapt its reliance on auditory and visual inputs. We further introduce a spatial audio augmentation method to ensure robustness under audio distractors. We evaluate our method on a beyond-sight ring-off task and demonstrate effective manipulation of sound-source objects beyond sight. Video is available on https://vision3d-lab.github.io/soundact/.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 34
Loading