Abstract: Real-time audio streaming transmission and processing play a crucial role in time-sensitive applications such as food delivery services and ride-hailing platforms, where rapid response is essential. However, existing server-based audio streaming architectures struggle to handle the high concurrency of massive mobile devices efficiently. Traditional compression methods like MP3 and AAC offer limited compression ratios, while deep learning-based approaches often fail to meet the real-time transmission demands of edge computing environments. In this paper, we propose a novel edge-to-server audio streaming architecture that leverages Mel filter bank spectral features to achieve ultra-high compression efficiency. Our system integrates audio denoising, Mel feature extraction, and quantization-based compression at the edge, effectively suppressing environmental and device-induced noise while achieving an extreme compression ratio of 0.39% relative to the original uncompressed audio. Compared to conventional methods like MP3, our approach further reduces the file size by 96.1%. The decompressed Mel features remain task-independent, enabling seamless support for various general-purpose audio processing tasks in the server. We evaluate our system across three key audio tasks: speech recognition, speech emotion recognition, and audio classification. Extensive experiments on five different mobile devices demonstrate a 93.10% reduction in transmission latency at 1 Mbps bandwidth compared to 64 kbps MP3 audio, while maintaining task performance within a 5% deviation from state-of-the-art (SOTA) models across six mainstream audio datasets. These results highlight the efficiency, robustness, and scalability of our approach for real-time edge-to-server audio processing.
External IDs:doi:10.1109/tmc.2025.3634628
Loading