Sylber 2.0: A Universal Syllable Embedding

ICLR 2026 Conference Submission20999 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised learning, representational learning, speech processing, speech coding
TL;DR: Sylber 2.0 is a universal syllable embedding that compresses speech audio at ~5 Hz and reconstructs with high fidelity for any languages and styles.
Abstract: Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this, we present Sylber 2.0, a universal framework for coding speech at the syllable level, enabling efficient temporal compression and high-fidelity reconstruction across multiple languages and expressive styles. Building on the original Sylber, Sylber 2.0 improves both linguistic coverage and reconstruction quality by training on diverse multilingual speech and introducing a syllable-level acoustic encoder and vocoder. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail. Experiments show that it performs on par with previous models operating on high-frequency baselines, and it outperforms the original Sylber by a significant margin. We further demonstrate the efficacy of Sylber 2.0 by training a text-to-speech model, which achieves comparable or better performance than current SOTA models using only 560 hours of data and 72M parameters. In sum, we establish an effective syllable-level abstraction for general spoken language.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 20999
Loading