DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Dongya Jia; Zhuo Chen; Jiawei Chen; Chenpeng Du; Jian Wu; Jian Cong; Xiaobin Zhuang; Chumin Li; Zhen Wei; Yuping Wang; Yuxuan Wang

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: DiTAR substantially enhances the capability of the continuous-valued autoregressive model and achieves SOTA performance in zero-shot speech generation.

Abstract: Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings, and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.

Lay Summary: In this work, we propose DiTAR, a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive modeling for continuous tokens and reduces computational demands. For inference, we introduce temperature as the introduction time point for noise while solving the reverse diffusion ODE. Applied to zero-shot speech synthesis, DiTAR achieves SOTA robustness, speaker similarity, and naturalness with substantially lower computational requirements.

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: continuous-valued language model; diffusion model; speech synthesis

Submission Number: 3728

Loading