Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
Abstract: Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/Unk-Uname/ARvsFM
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We would like to express our gratitude toward all the reviewers for their constructive feedback on our work.
This feedback has led to notable improvements in the suggested work and yielded better, deeper analysis that would benefit future research.
We followed the suggestion made by reviewer eTk3, and revised the paper to focus on better deriving actionable insights from the observed experimentation.
We have replaced the version of the paper with the revised one.
List of changes made in paper revisions:
- Added appendixes:
- Dataset specifications (A) - specs comparing our data with MusicCaps
- Latent representation performance (B) - demonstrate comparable reconstruction quality
- Fixed training setup evaluation over MusicCaps (F)
- Introduction:
- Rewritten the concise summary of conclusions to better reflect actionable insights. It now cascades a summarized takeaway paragraph of the performed experiments (which hold the relevant conclusions).
- Included limitations of this study in the summary table
- Sec 3.3 (Conditional flow-matching background):
- Added “Similarity to Diffusion Modeling” paragraph which mentions that Gaussian noise flow matching could be seen as equivalent to stochastic diffusion process from a gaussian prior.
- Sec 4 (Exp setup):
- 4.1 Clarifications for not using MusicCaps for evaluation, see appendix A.
- 4.2 Clarifications regarding use of EnCodec as the chosen representation space, appendix .
- 4.2 Backbone architecture - motivation for including skip connections in the FM case. Specify the exact configuration used for the transformer backbone.
- Sec 5.1 (Fixed training setup):
- Added clarification regarding the observed gap vs VAE-based representation in paragraph 1. Refer to MusicCaps evaluation set in appendix.
- Added Table 4 + paragraph 2 - ablation study to isolate latent
- Sec 5.2 (Temporally aligned controls):
- Added a paragraph analyzing possible hindrance of the conditioning method (4th paragraph)
- Updated takeaway
- Sec 5.4 (Runtime analysis and model scaling):
- Added Figure 6 - An ablation over sequence extension, and exclusion of cross-attention.
- Added derivation of runtime complexities expectation, comparison to practical performance and analysis of reasons for differences between them.
- Added appendix K - increasing model size and impact over inference.
- Modified takeaway.
- Sec 5.5 (sensitivity to training configuration): extended takeaway slightly.
- Sec 6 (Conclusions):
- Rewritten
- Added broader impact
Assigned Action Editor: ~Tatsuya_Harada1
Submission Number: 5061
Loading