Keywords: Robot Manipulation, Multimodal Learning, Imitation Learning, Egocentric Perception
TL;DR: SoundAct uses stereo audio and egocentric vision to enable robust beyond-sight robot manipulation under partial observability.
Abstract: Humans naturally use auditory and visual cues to interact with objects beyond sight. However, most robot manipulation frameworks rely solely on vision, limiting their ability to handle audio-driven tasks (e.g., alarm clock turn-off) and out-of-view events. To address this, we propose SoundAct, a spatial sound-aware egocentric robot manipulation framework that integrates stereo microphones with an egocentric camera for beyond-sight spatial audio reasoning. We encode directional cues from stereo audio as magnitude spectrograms and fuse them with visual features via an attention mechanism, enabling the policy to jointly reason over auditory and visual cues. We further introduce a spatial audio augmentation method to improve robustness under audio distractors. We evaluate our method on a beyond-sight ring-off task and demonstrate effective manipulation of sound-source objects beyond sight.
Submission Number: 44
Loading