TFMAudio: High-Fidelity Long-Form Text-to-Audio via Mamba-based Flow Matching

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-Audio, Flow-Matching, Mamba, SSM, Diffusion
Abstract: Recent advancements in audio generation have been dominated by transformer-based diffusion models, which face challenges in extrapolating positional encodings and exhibit quadratic complexity in self-attention, limiting their consistency and efficiency for long-form generation. To address these limitations, we propose TFMAudio, a novel latent audio generation model that integrates the strengths of Flow Matching and a custom-designed TFMamba backbone. TFMamba employs a dual-scan mechanism: TimeMamba captures long-range causal dependencies with linear complexity, while FrequencyMamba models spectral correlations such as harmonic structures. To enhance stability, we further introduce Energy-Aware Guidance (EAG), which mitigates state drift by adaptively regularizing classifier-free guidance. Experiments demonstrate that TFMAudio achieves state-of-the-art performance on text-to-audio benchmarks and exhibits robust extrapolation to ultra-long sequences. Remarkably, our model generates 30-minute high-fidelity audio while preserving temporal consistency and semantic alignment, significantly advancing the scalability and usability of text-to-audio models. Demo:https://huggingface.co/spaces/tfmaudio/TFMAudio
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5772
Loading