TangoFlux: Text to Audio Generation with CLAP-Ranked Preference Optimization

ICLR 2026 Conference Submission16889 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: text to audio, flow matching, preference optimization
TL;DR: Text to audio generation with semi online preference optimization performed on self-generated audio samples
Abstract: We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in 3.7 seconds on a A40 GPU. A key challenge in aligning TTA models lies in creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We show that the audio preference dataset generated using CRPO outperforms the static alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. https://tangoflux56.github.io/TangoFlux/ holds the model-generated audio samples for comparison.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16889
Loading