Keywords: text-to-speech, flow matching, meanflow, efficiency, speed-quality tradeoff
TL;DR: We propose SplitMeanFlow, a framework that accelerates speech synthesis by learning average velocity and enabling one-step generation without sacrificing quality.
Abstract: Flow Matching has achieved prominent performance in generative modeling, yet it is plagued by high computational costs due to iterative sampling. Recent approaches such as MeanFlow address this bottleneck by learning average velocity fields instead of instantaneous velocities. However, we demonstrate that MeanFlow’s differential formulation is a special case of a more fundamental principle. In this work, we revisit the first principles of average velocity fields and derive a key algebraic identity: Interval Splitting Consistency. Building on this, we propose SplitMeanFlow, a novel framework that directly enforces this algebraic consistency as a core learning objective. Theoretically, we show that SplitMeanFlow recovers MeanFlow’s differential identity in the limit, thereby establishing a more general and robust basis for average velocity field learning.
Practically, SplitMeanFlow simplifies training by eliminating the need for JVP and enables one-step synthesis. Extensive experiments on large-scale speech synthesis tasks verify its superiority: SplitMeanFlow achieves a 10$\times$ speedup and a 20$\times$ reduction in computational cost, while preserving speech quality, delivering substantial efficiency gains without compromising generative performance.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13368
Loading