Mapache: Masked Parallel Transformer for Advanced Speech Editing and Synthesis

Guillermo Cámbara, Duo Wang

Published: 13 Apr 2024, Last Modified: 21 May 2025ICASSP 2024EveryoneRevisionsCC BY 4.0

Abstract: Recent advancements in Generative AI, such as scaled Transformer large language models (LLM) and diffusion decoders, have revolutionized speech synthesis. With speech encompassing the complexities of natural language and audio dimensionality, many recent models have relied on autoregressive modeling of quantized speech tokens. Such an approach limits speech synthesis to left-to-right generation, making these models unsuitable for speech edits free from audio discontinuities. We introduce Mapache, a novel architecture that combines a non-autoregressive masked speech language model with acoustic diffusion modeling, offering a unique, fully parallel pipeline. Mapache excels in precise speech editing that is indiscernible to human listeners, exhibiting inpainting and zero-shot synthesis capabilities that either surpass or rival those of other state-of-the-art models that specialize in just one of these tasks. This paper also sheds light on optimizing the decoding process for such non-autoregressive models.