Generating ambient sounds and effects is a challenging task due to data scarcity and often insufficient caption quality, making it difficult to em- ploy large-scale generative models for the task. In this work, we tackle this problem by introducing two new models. First, we propose AutoCap , a high- quality and efficient automatic audio captioning model. By using a compact audio representation and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. Using AutoCap to generate caption clips from existing audio datasets, we demon- strate the benefits of data scaling with synthetic captions as well as model size scaling. When compared to state-of-the-art audio generators trained at similar size and data scale, GenAu obtains significant improvements of 4.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. Moreover, we propose an efficient and scalable pipeline for collecting audio datasets, enabling us to compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset, which is 100 times larger than existing ones. Our code, model checkpoints, and dataset will be made publicly available upon acceptance.
Left: An overview of the proposed architecture for our AutoCap model. Frozen CLAP and HTSAT audio encoders produce the audio representation.
To reduce the large number of tokens produced by the HTSAT encoder, we use a Q-Former, reducing the amount of input tokens by a factor of 4.
A pretrained BART encoder-decoder aggregates the tokens, producing the output caption.
Right: Overview of our GenAu model based on an FIT-based latent audio generator. A frozen 1D-VAE produces the latent audio representation.
Input patches are divided into groups and processed by `local' attention layers. `read' and `write' operations implemented as cross attention layers transfer information between patches and latents.
Finally, `global' attention layers process latent tokens with attention spanning over all groups, enabling global communication.
Here we compare our captioning method with ENCLAP and CoNeTTE. We show the original audio from the AudioCaps test set in the first column and as well as ground truth caption in the second. In the thrird column we show caption predicted by our method. Fourth and Fifth columns corresponds to the baselines. Our method consistently generates more descriptive and accurate captions. For instance, it is the only method that captures the water splashing in the first example and identifies both birds chirping and insects buzzing in the second. Our method generates audio clips with overall better quality, realism, and prompt alignment. It is the only method that captures all events in the first example and produces the most realistic audio for the second. Additionally, it is the only method to generate the horse 'growling' in the third example.
Input | Groundtruth Caption | Ours | ENCLAP | CoNeTTE |
---|---|---|---|---|
A man talking as ocean waves trickle and splash while wind blows into a microphone | A man speaks as wind blows and water splashes | A man is speaking and wind is blowing | A man is speaking and wind is blowing | |
An adult male speaks, birds chirp in the background, and many insects are buzzing | Birds chirp in the distance, followed by a man speaking nearby, after which insects buzz nearby | Birds are chirping and a man speaks | A man speaking with birds chirping in the background. | |
A telephone dialing tone followed by a plastic switch flipping on and off | A telephone dialing followed by a series of plastic clicking then plastic clanking before plastic thumps on a surface | A telephone dialing followed by a series of electronic beeps | A telephone ringing followed by a beep. | |
A running train and then a train whistle | A train moves getting closer and a horn is triggered | A train running on railroad tracks followed by a train horn blowing as wind blows into a microphone | A train horn blows and a steam whistle is blowing | |
A female speaking with some rustling followed by another female speaking | Dishes are being moved and a woman laughs and speaks | A woman speaking followed by clanking | A woman is speaking and a child is laughing. | |
A child is speaking followed by a door moving | A child speaks followed by a loud crash and a scream | A young girl speaks followed by a loud bang | A woman speaking followed by a door opening and closing. | |
Water splashing as a baby is laughing and birds chirp in the background | A baby laughs and splashes, and an adult female speaks | A baby laughs and splashes in water | A baby is laughing and people are talking. | |
Leaves rustling in the wind with dogs barking and birds chirping | Birds chirp in the distance, and then a dog barks nearby | Birds chirp and a dog barks | A dog is barking and a person is walking. | |
Tapping followed by water spraying and more tapping | Some light rustling followed by a clank then water pouring | A faucet is turned on and runs | A toilet is flushed and water is running. |
We compare our method with state-of-the-art approaches on non-cherry-picked examples. The first row lists the method names, while the first column contains the input text used to generate the audio.
Input | Ours | Make-an-audio | AudioLDM | AudioLDM2 | Stable Audio | Tango |
---|---|---|---|---|---|---|
A muffled man talking as a goat baas before and after two goats baaing in the distance while wind blows into a microphone | ||||||
A small child and woman speak with splashing water | ||||||
Horses growl and clop hooves. | ||||||
A woman speaks with chirping frogs and distant music playing | ||||||
A vehicle driving by while splashing water as a stream of water trickles and flows followed by a thunder roaring in the distance while wind blows into a microphone | ||||||
Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone | ||||||
A man speaks with a high frequency hum with some banging and clanking |
We compare different variants of our method. The top row shows the captions used for generation, followed by the second row, which displays a U-Net baseline. The next three rows present our small model trained on AudioCaps, on the full dataset without recaptioning, and on the fully recaptioned dataset. The final two rows show our large model trained on AudioCaps and the fully recaptioned dataset.
A person briefly talks followed quickly by toilet flushing and another voice from another person | A gun cocking then firing as metal clanks on a hard surface followed by a man talking during an electronic laser effect as gunshots and explosions go off in the distance | Motorcycle starting and taking off | A female laughs, snoring occurs, and an adult male speaks in the background | |
---|---|---|---|---|
W/ U-Net | ||||
Small (AudioCaps) | ||||
Small (AutoReCap w/o Recaptioning) | ||||
Small (AutoReCap) | ||||
Large (AudioCaps) | ||||
Large (AutoReCap) |
High resolution videos were excluded due to the size limitations.
Here we provide an non-curated list of samples generated with GenAU-L using AudioCaps test captions.
Input | Sample |
---|---|
A man speaks followed by a toilet flush | |
A man making a horn sound and then speaking | |
A woman talks and a baby whispers | |
A group of people laughing followed by farting | |
A woman talking followed by a group of people laughing as plastic crinkles | |
Sustained industrial engine noise | |
A vehicle engine revving as a crowd of people talk | |
A speedboat is racing across water with loud wind noise | |
A kid crying as a man and a woman talk followed by a car door opening then closing | |
A cat meowing and young female speaking | |
Police car siren starts with two horn blasts then becomes a high pitched wail | |
Rustling pigeons coo | |
A man and woman laughing followed by a man shouting then a woman laughing as a child laughs | |
A man talking while wood clanks on a metal pan followed by gravel crunching as food and oil sizzle | |
A motor is running, and metal clanging is present | |
A muffled man talking as a goat baas before and after two goats baaing in the distance while wind blows into a microphone | |
An adult male is speaking in a quiet environment | |
A small motor is buzzing and water is running, splashing and gurgling | |
A man talking followed by a brush scrapping then liquid spraying in the background | |
An idle vehicle engine running followed by a gear cranking then revving | |
A cat meows and hisses | |
Ocean waves crashing as a man talks in the distance and wind heavily blows into a microphone | |
A man speaking then a baby crying, duck quacking in background and finally a woman speaking | |
Some groaning followed by a woman speaking | |
A crowd murmurs as a siren blares and then stops at a distance | |
A large bell rings out multiple times | |
A cat meowing once with a thud | |
A drilling sound with humming in the background | |
A cat meowing twice | |
A rolling train blows its horn multiple times | |
A man speaks and a vehicle passes | |
A woman is speaking over a microphone | |
A vehicle engine revving then accelerating at a high rate as a metal surface is whipped followed by tires skidding | |
Instrumental music playing as a woman speaks followed by rain pouring then rain falling on a surface | |
A mid-size motor vehicle engine accelerates and is accompanied by hissing and spinning tires, then it decelerates and an adult male begins to speak | |
Humming of an accelerating engine with wind passing and rustling | |
Humming and vibrating of a power tool with some high frequency squealing | |
A car is passing by with leaves rustling | |
Train engine as it travels | |
A dog barks with distant birds chirping then people speak | |
Some child speaking in the distant and a toilet flushing | |
A man talking while bongos play followed by frogs croaking | |
A group of people talking in the background as compressed air sprays while a tin can rattles followed by a man talking | |
An adult female speaks in a quiet environment | |
A subway train signal plays followed by a bell chiming followed by a horn honking as a crowd of people talk in the background | |
Wind blowing and water splashing | |
Loud humming followed by hissing | |
Thuds on floor | |
Waves and wind rake a shore | |
Clicking and sputtering of a running engine with people speaking and wind blowing | |
Splashing water with children speaking and people screaming with a distant blow of a whistle | |
A man is speaking followed by a tap and motorcycle turning on | |
Metal shuffling followed by plastic clicking as wind blows into a microphone | |
A woman speaking | |
A stream of water trickling as plastic clanks against a metal surface followed by water pouring down a drain alongside a camera muffling | |
An adult male gives a speech | |
An aircraft engine running then slowing down after a plastic click | |
A sound of vibrating motor | |
Male speaking, laughter and shouting and clapping | |
A woman speaks briefly, and a muffled engine rumbles | |
Plastic clacking followed by as person breathing then liquid pouring into containers | |
A woman and a man talking as another man talks softly and papers shuffle in the background | |
Someone burps and then laughs | |
Chirping birds near and far | |
A dog barking as a man is talking while birds chirp and wind blows into a microphone | |
A sudden horn blare as a train passes | |
A man is speaking while typing | |
Motorcycle starting then driving away | |
A group of men speaking as cannons fire while rain falls and water splashes followed by thunder roaring | |
A man speaking followed by another man speaking with some rustling | |
Waves roll slowly and water swirls as the wind blows | |
A train sounds horn while traveling on train track | |
Water splashing and trickling as wind blows into a microphone while a man speaks over a radio | |
Some rustling then silence then traffic passing in the distance with a cat meowing | |
A vehicle motor running idle followed by a car horn honking then a group of men groaning | |
A car engine is revving while driving | |
Cats meowing and then wind | |
There is a mature male talking to some animals | |
A man speaking on a microphone as a crowd of people laugh followed by glass clinking | |
An engine and speech on a loudspeaker | |
An idle vehicle engine running normally before stuttering | |
An engine revving and then tires squealing | |
Chainsaw being run | |
A man speaks with low speech in the background | |
Race cars are racing followed by people talking | |
The sound of horn from a car approaching from a distance | |
A long burp ends in a sigh | |
A man speaking as vehicles drive by and leaves rustling | |
A chainsaw cutting as wood is cracking | |
A cat meows and a woman speaks | |
A gun cocking then firing as metal clanks on a hard surface followed by a man talking during an electronic laser effect as gunshots and explosions go off in the distance | |
A male voice and a machine buzzing | |
Speech followed by quietness and a man speaks and laughs | |
A man speaks with some clinking and clanking | |
A man speaking as rain lightly falls followed by thunder | |
A bell is ringing | |
A woman is speaking from a microphone | |
Wind is blowing and heavy rain is falling and splashing | |
Muffled sounds followed by metal being hit | |
A vehicle driving as a man and woman are talking and laughing |