TinyVocos: Neural Vocoders on MCUs

Published: 01 Jan 2024, Last Modified: 04 Nov 2024IS2 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Neural Vocoders convert time-frequency representations, such as mel-spectrograms, into corresponding time representations. Vocoders are essential for generative applications in audio (e.g. text-to-speech and text-to-audio). This paper presents a scalable vocoder architecture for small-footprint edge devices, inspired by Vocos and adapted with XiNets and PhiNets. We test the developed model capabilities qualitatively and quantitatively on single-speaker and multi-speaker datasets and benchmark inference speed and memory consumption on four microcontrollers. Additionally, we study the power consumption on an ARM Cortex-M7-powered board. Our results demonstrate the feasibility of deploying neural vocoders on resource-constrained edge devices, potentially enabling new applications in Internet of Sounds (IoS) and Embedded Audio scenarios. Our best-performing model achieves a MOS score of 3.95/5 while utilizing 1.5MiB of FLASH and 517KiB of RAM and consuming 252 mW for a 1s audio clip inference.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview