Neural Codec Language Model for Controllable Timbre Transfer in Music Synthesis

Sheldon Liu; Tianyu Liu; Deepak Dalakoti; Adithya Suresh; Yueying Teng; Xuefeng Liu; Atanu Roy; Randeep Bhatia; Daniel Hatadi; Prabhjeet Ghuman

Neural Codec Language Model for Controllable Timbre Transfer in Music Synthesis

Sheldon Liu, Tianyu Liu, Deepak Dalakoti, Adithya Suresh, Yueying Teng, Xuefeng Liu, Atanu Roy, Randeep Bhatia, Daniel Hatadi, Prabhjeet Ghuman

Published: 14 Nov 2025, Last Modified: 16 Dec 2025EAIM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural codec language models, Timbre transfer, Controllable music synthesis, Zero-shot generalization, Audio generation

TL;DR: We solve controllable music timbre transfer by adapting neural codec language models with implicit audio conditioning, achieving substantial improvements and releasing the first comprehensive benchmark dataset.

Abstract: Neural codec language models have revolutionized speech synthesis but face significant challenges when adapted to music generation, particularly in achieving precise timbre con- trol while preserving melodic content. We introduce Neural Code Language Model for Controllable Timbre Transfer (NCLMCTT), a novel architecture that enables zero-shot instrument cloning through direct audio conditioning without explicit timbre learning. Our approach combines a 385M-parameter transformer for coarse musical structure modeling with a specialized upsampler for fine timbral detail, achieving flexible control through 1-5 second reference audio segments. We establish the first comprehensive benchmark dataset for controllable timbre transfer evaluation, comprising 62,500 high-fidelity samples across 50 synthesizer presets with ground truth targets. Extensive experiments demonstrate sub- stantial improvements over the TokenSynth baseline: 27.1% reduction in SI-SDR, 50.9% in Mel Distance, and 59.4% in STFT Distance, while maintaining strong melodic coher- ence (Chroma Similarity: 0.85). Our method achieves robust zero-shot generalization, with performance on unseen instrument presets matching that of seen presets. Ablation stud- ies confirm that extended reference audio duration (40.8% improvement), cross-attention mechanisms (11.9% improvement), and increased model capacity contribute meaningfully to overall performance. By separating melodic content from timbral characteristics and enabling implicit timbre control, NCLMCTT provides both immediate practical value for music creators and a methodological foundation for advancing controllable neural audio synthesis.

Cameraready Material: zip

Submission Number: 19

Loading