- Abstract: Neural Style Transfer has become a popular technique for generating images of distinct artistic styles using convolutional neural networks. This recent success in image style transfer has raised the question of whether similar methods can be leveraged to alter the “style” of musical audio. In this work, we attempt long time-scale high-quality audio transfer and texture synthesis in the time-domain that captures harmonic, rhythmic, and timbral elements related to musical style, using examples that may have different lengths and musical keys. We demonstrate the ability to use randomly initialized convolutional neural networks to transfer these aspects of musical style from one piece onto another using 3 different representations of audio: the log-magnitude of the Short Time Fourier Transform (STFT), the Mel spectrogram, and the Constant-Q Transform spectrogram. We propose using these representations as a way of generating and modifying perceptually significant characteristics of musical audio content. We demonstrate each representation's shortcomings and advantages over others by carefully designing neural network structures that complement the nature of musical audio. Finally, we show that the most compelling “style” transfer examples make use of an ensemble of these representations to help capture the varying desired characteristics of audio signals.
- TL;DR: We present a long time-scale musical audio style transfer algorithm which synthesizes audio in the time-domain, but uses Time-Frequency representations of audio.
- Keywords: Musical audio, neural style transfer, Time-Frequency, Spectrogram