Keywords: Neural codec language models, Timbre transfer, Controllable music synthesis, Zero-shot generalization, Audio generation
TL;DR: We solve controllable music timbre transfer by adapting neural codec language models with implicit audio conditioning, achieving substantial improvements and releasing the first comprehensive benchmark dataset.
Abstract: Neural codec language models have revolutionized speech synthesis but face significant
challenges when adapted to music generation, particularly in achieving precise timbre con-
trol while preserving melodic content. We introduce Neural Code Language Model for
Controllable Timbre Transfer (NCLMCTT), a novel architecture that enables zero-shot
instrument cloning through direct audio conditioning without explicit timbre learning. Our
approach combines a 385M-parameter transformer for coarse musical structure modeling
with a specialized upsampler for fine timbral detail, achieving flexible control through 1-5
second reference audio segments. We establish the first comprehensive benchmark dataset
for controllable timbre transfer evaluation, comprising 62,500 high-fidelity samples across
50 synthesizer presets with ground truth targets. Extensive experiments demonstrate sub-
stantial improvements over the TokenSynth baseline: 27.1% reduction in SI-SDR, 50.9%
in Mel Distance, and 59.4% in STFT Distance, while maintaining strong melodic coher-
ence (Chroma Similarity: 0.85). Our method achieves robust zero-shot generalization, with
performance on unseen instrument presets matching that of seen presets. Ablation stud-
ies confirm that extended reference audio duration (40.8% improvement), cross-attention
mechanisms (11.9% improvement), and increased model capacity contribute meaningfully
to overall performance. By separating melodic content from timbral characteristics and
enabling implicit timbre control, NCLMCTT provides both immediate practical value for
music creators and a methodological foundation for advancing controllable neural audio
synthesis.
Cameraready Material: zip
Submission Number: 19
Loading