The Rhythm In Anything: Audio-Prompted Drums Generation With Masked Language Modeling

Published: 2025, Last Modified: 19 Dec 2025ISMIR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Musicians and nonmusicians alike use rhythmic sound gestures, such as tapping and beatboxing, to express drum patterns. Often, these rhythmic gestures function as sketches of more complex patterns, voicing key elements while eliding or implying others. While these gestures provide an intuitive method for communicating musical ideas, realizing these ideas as fully-produced drum recordings often requires significant time and skill. To bridge this gap, we present TRIA (The Rhythm In Anything), a conditional generative model for mapping rhythmic sound gestures to high-fidelity drum recordings. Given an audio prompt of the desired rhythmic pattern and a second prompt to represent drumkit timbre, TRIA produces audio of a drumkit playing the desired rhythm (with appropriate elaborations) in the desired timbre. Subjective and objective evaluations show that a TRIA model trained on less than 10 hours of publicly-available drum data can generate high-quality, faithful realizations of sound gestures across a wide range of timbres in a zero-shot manner.
Loading