Abstract: Neural Vocoders convert time-frequency representations, such as mel-spectrograms, into corresponding time representations. Vocoders are essential for generative applications in audio (e.g. text-to-speech and text-to-audio). This paper presents a scalable vocoder architecture for small-footprint edge devices, inspired by Vocos and adapted with XiNets and PhiNets. We test the developed model capabilities qualitatively and quantitatively on single-speaker and multi-speaker datasets and benchmark inference speed and memory consumption on four microcontrollers. Additionally, we study the power consumption on an ARM Cortex-M7-powered board. Our results demonstrate the feasibility of deploying neural vocoders on resource-constrained edge devices, potentially enabling new applications in Internet of Sounds (IoS) and Embedded Audio scenarios. Our best-performing model achieves a MOS score of 3.95/5 while utilizing 1.5MiB of FLASH and 517KiB of RAM and consuming 252 mW for a 1s audio clip inference.
Loading