TimbrePalette: A Controllable Multi-Style Generation Model for Timbre Enhancement

Fanzhe Fu; Yuxuan Cao; Renhong Huang; Yize Zhu; Guochen Xu; Yu Lu; Yang Yang

TimbrePalette: A Controllable Multi-Style Generation Model for Timbre Enhancement

Fanzhe Fu, Yuxuan Cao, Renhong Huang, Yize Zhu, Guochen Xu, Yu Lu, Yang Yang

14 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Controllable Audio Generation, Timbre Enhancement, Neural Audio Effects, Wave-U-Net, Conditional Generation, Style Modeling

TL;DR: We introduce TimbrePalette, a conditional Wave-U-Net trained using "Style Anchors"—a novel paradigm where high-quality DSP chains define subjective aesthetics, allowing the model to controllably enhance audio timbre and smoothly blend between styles.

Abstract: The growing accessibility of music creation tools and the rise of AI music generation models have led to an increasing demand for efficient, high-quality, and user-friendly tools for audio timbre enhancement. However, traditional Digital Signal Processing (DSP) effect chains often lack content-awareness, while naive deep learning approaches frequently face training instability when directly imitating complex audio effects. To address these challenges, we propose TimbrePalette, an innovative, controllable multi-style timbre enhancement model based on a conditioned Wave-U-Net. Our research begins with a systematic investigation into the stability challenges inherent in waveform-to-waveform generation tasks, establishing a robust training framework with a stable loss function and advanced model architecture. Based on this framework, we introduce a novel paradigm: first, we design and implement three high-quality DSP algorithms representing distinct perceptual dimensions ("Fullness", "Warmth", "Layeredness") to serve as "Style Anchors". Then, we train a single, unified TimbrePalette model to learn the generation of corresponding enhanced audio based on an explicit style command. Comprehensive objective evaluations demonstrate that our single model not only reproduces the target styles with high fidelity but also significantly outperforms both specialized single-style models and strong time-domain baselines, including Conv-TasNet. Furthermore, we quantitatively show the model's ability to smoothly "blend" between styles, proving that it has learned a meaningful and continuous latent space of timbre aesthetics. TimbrePalette offers a powerful, efficient, and creative solution for quality improvement for both musicians and creators working with AI-generated content.

Primary Area: generative models

Submission Number: 5048

Loading