SING: Spatial Context in Large Language Model for Next-Gen Wearables

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Abstract: Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI’s Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of 25.72°—a substantial improvement compared to the 88.52° median error in existing work—with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16°. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.
Lay Summary: Imagine wearing smart earbuds that could not only understand what people are saying but also know exactly where each voice is coming from, enabling revolutionary applications like automatically summarizing who said what in a meeting or helping visually impaired users navigate by identifying the direction of important sounds. Current wearable devices can't do this because traditional spatial audio systems require bulky microphone arrays that are too large and power-hungry for small wearables. We developed SING, a breakthrough system that achieves precise spatial speech understanding using a single microphone enhanced with a tiny microstructure. This microstructure creates spatial diversity in sound recordings without needing multiple microphones, making it perfect for wearables. Our system combines this compact spatial sensing with OpenAI's Whisper speech recognition and integrates everything into a large language model (LLaMA-3.2) that can reason both what was said and where it came from. SING dramatically improves spatial accuracy, reducing directional errors from 88.52° to just 25.72° while maintaining excellent speech recognition (5.3%-word error rate). It can simultaneously track up to five speakers with 16° median directional accuracy, enabling applications like spatially aware meeting transcription, sound-based navigation for accessibility, and immersive augmented reality experience, all while running efficiently on small wearable devices and preserving privacy through on-device processing.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Language, Speech and Dialog
Keywords: Spatial Speech ASR, Direction of Arrival, Large Language Models
Submission Number: 7685
Loading