CoHear: Conversation Enhancement via Multi-earphone Collaboration
Abstract: In crowded social settings like conferences, background noise, overlapping voices, and lively interactions often lead to “cocktail party deafness,” hindering clear conversation. While modern earphones are a promising platform for speech enhancement, existing solutions are limited: they either operate on a single device, ignoring the multi-party nature of conversation, or rely on impractical assumptions like fixed conversation areas and pre-recorded audio. We present CoHear, a collaborative system that leverages a network of earphones to holistically model and enhance speech at the conversation level. CoHear bridges acoustic sensor networks with deep learning for target speech extraction through two key contributions: 1) a novel, conversation-driven network that dynamically forms groups based on user interaction, using verbal and non-verbal cues (primarily head orientation) for robust, infrastructure-free coordination; and 2) a bandwidth-efficient, robust target speech extraction model that effectively utilizes peer-relayed audio as conditioning signals, even under network constraints. CoHear is evaluated in both real-world experiments and simulations. Results show that our conversation network obtains more than 90% accuracy in group formation, improves the speech quality by up to 8.8 dB over state-of-the-art baselines, and demonstrates real-time performance on a mobile device. In a user study with 20 participants, CoHear has a much higher score than baseline with good usability.
Loading