Keywords: text-to-speech, audio generation, neural audio codecs, codec LMs, co-design
TL;DR: We propose multiple codec-LM co-design techniques that collectively improve both the efficiency and TTS performance of neural codec LMs.
Abstract: Neural codec language models (or _codec LMs_) are emerging as a powerful framework for text-to-speech (TTS) and other audio generation tasks. These models leverage advancements in language modeling and high-fidelity residual vector quantization (RVQ)-based audio codecs, which compress continuous waveforms into discrete codes for LMs to process. Despite the close interdependence of codecs and LMs in these systems, research on codecs and LMs has largely remained siloed. In this work, we bridge this gap by proposing several codec-LM co-design strategies, analyzing their effects on end-to-end TTS performance and efficiency. Specifically, we introduce three complementary techniques: (i) a _frame-wise codec encoder_ that improves both LM log-likelihood and end-to-end TTS metrics, (ii) _LM codebook level dropout_, a method to efficiently navigate a portion of the codec-LM design space by training a single LM, and (iii) _increased codec frame duration_, which we show can accelerate inference while maintaining end-to-end performance. Our experiments demonstrate that combining all three co-design techniques results in doubled inference speed, and improvements in intelligibility, audio quality, and speaker control in TTS relative to a siloed baseline.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5407
Loading