Taming Data and Transformers for Audio Generation


Anonymous Authors

Abstract


Generating ambient sounds and effects is a challenging task due to data scarcity and often insufficient caption quality, making it difficult to em- ploy large-scale generative models for the task. In this work, we tackle this problem by introducing two new models. First, we propose AutoCap , a high- quality and efficient automatic audio captioning model. By using a compact audio representation and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. Using AutoCap to generate caption clips from existing audio datasets, we demon- strate the benefits of data scaling with synthetic captions as well as model size scaling. When compared to state-of-the-art audio generators trained at similar size and data scale, GenAu obtains significant improvements of 4.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. Moreover, we propose an efficient and scalable pipeline for collecting audio datasets, enabling us to compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset, which is 100 times larger than existing ones. Our code, model checkpoints, and dataset will be made publicly available upon acceptance.

Left: An overview of the proposed architecture for our AutoCap model. Frozen CLAP and HTSAT audio encoders produce the audio representation. To reduce the large number of tokens produced by the HTSAT encoder, we use a Q-Former, reducing the amount of input tokens by a factor of 4. A pretrained BART encoder-decoder aggregates the tokens, producing the output caption.
Right: Overview of our GenAu model based on an FIT-based latent audio generator. A frozen 1D-VAE produces the latent audio representation. Input patches are divided into groups and processed by `local' attention layers. `read' and `write' operations implemented as cross attention layers transfer information between patches and latents. Finally, `global' attention layers process latent tokens with attention spanning over all groups, enabling global communication.


Comparison of AutoCap with other audio captioning methods:

Here we compare our captioning method with ENCLAP and CoNeTTE. We show the original audio from the AudioCaps test set in the first column and as well as ground truth caption in the second. In the thrird column we show caption predicted by our method. Fourth and Fifth columns corresponds to the baselines. Our method consistently generates more descriptive and accurate captions. For instance, it is the only method that captures the water splashing in the first example and identifies both birds chirping and insects buzzing in the second. Our method generates audio clips with overall better quality, realism, and prompt alignment. It is the only method that captures all events in the first example and produces the most realistic audio for the second. Additionally, it is the only method to generate the horse 'growling' in the third example.

Input Groundtruth Caption Ours ENCLAP CoNeTTE
A man talking as ocean waves trickle and splash while wind blows into a microphone A man speaks as wind blows and water splashes A man is speaking and wind is blowing A man is speaking and wind is blowing
An adult male speaks, birds chirp in the background, and many insects are buzzing Birds chirp in the distance, followed by a man speaking nearby, after which insects buzz nearby Birds are chirping and a man speaks A man speaking with birds chirping in the background.
A telephone dialing tone followed by a plastic switch flipping on and off A telephone dialing followed by a series of plastic clicking then plastic clanking before plastic thumps on a surface A telephone dialing followed by a series of electronic beeps A telephone ringing followed by a beep.
A running train and then a train whistle A train moves getting closer and a horn is triggered A train running on railroad tracks followed by a train horn blowing as wind blows into a microphone A train horn blows and a steam whistle is blowing
A female speaking with some rustling followed by another female speaking Dishes are being moved and a woman laughs and speaks A woman speaking followed by clanking A woman is speaking and a child is laughing.
A child is speaking followed by a door moving A child speaks followed by a loud crash and a scream A young girl speaks followed by a loud bang A woman speaking followed by a door opening and closing.
Water splashing as a baby is laughing and birds chirp in the background A baby laughs and splashes, and an adult female speaks A baby laughs and splashes in water A baby is laughing and people are talking.
Leaves rustling in the wind with dogs barking and birds chirping Birds chirp in the distance, and then a dog barks nearby Birds chirp and a dog barks A dog is barking and a person is walking.
Tapping followed by water spraying and more tapping Some light rustling followed by a clank then water pouring A faucet is turned on and runs A toilet is flushed and water is running.

Comparison of GenAu with text-to-audio generation methods:

We compare our method with state-of-the-art approaches on non-cherry-picked examples. The first row lists the method names, while the first column contains the input text used to generate the audio.

                  Input                        Ours Make-an-audio AudioLDM AudioLDM2 Stable Audio Tango
A muffled man talking as a goat baas before and after two goats baaing in the distance while wind blows into a microphone
A small child and woman speak with splashing water
Horses growl and clop hooves.
A woman speaks with chirping frogs and distant music playing
A vehicle driving by while splashing water as a stream of water trickles and flows followed by a thunder roaring in the distance while wind blows into a microphone
Large church bells ring as rain falls on a hard surface and wind blows lightly into a microphone
A man speaks with a high frequency hum with some banging and clanking

Ablating different variants of GenAu:

We compare different variants of our method. The top row shows the captions used for generation, followed by the second row, which displays a U-Net baseline. The next three rows present our small model trained on AudioCaps, on the full dataset without recaptioning, and on the fully recaptioned dataset. The final two rows show our large model trained on AudioCaps and the fully recaptioned dataset.

A person briefly talks followed quickly by toilet flushing and another voice from another person A gun cocking then firing as metal clanks on a hard surface followed by a man talking during an electronic laser effect as gunshots and explosions go off in the distance Motorcycle starting and taking off A female laughs, snoring occurs, and an adult male speaks in the background
W/ U-Net
Small (AudioCaps)
Small (AutoReCap w/o Recaptioning)
Small (AutoReCap)
Large (AudioCaps)
Large (AutoReCap)

Random samples from AutoReCap-XL:

High resolution videos were excluded due to the size limitations.

Non-curated list of samples generated by GenAu:

Here we provide an non-curated list of samples generated with GenAU-L using AudioCaps test captions.

                  Input                        Sample
A man speaks followed by a toilet flush
A man making a horn sound and then speaking
A woman talks and a baby whispers
A group of people laughing followed by farting
A woman talking followed by a group of people laughing as plastic crinkles
Sustained industrial engine noise
A vehicle engine revving as a crowd of people talk
A speedboat is racing across water with loud wind noise
A kid crying as a man and a woman talk followed by a car door opening then closing
A cat meowing and young female speaking
Police car siren starts with two horn blasts then becomes a high pitched wail
Rustling pigeons coo
A man and woman laughing followed by a man shouting then a woman laughing as a child laughs
A man talking while wood clanks on a metal pan followed by gravel crunching as food and oil sizzle
A motor is running, and metal clanging is present
A muffled man talking as a goat baas before and after two goats baaing in the distance while wind blows into a microphone
An adult male is speaking in a quiet environment
A small motor is buzzing and water is running, splashing and gurgling
A man talking followed by a brush scrapping then liquid spraying in the background
An idle vehicle engine running followed by a gear cranking then revving
A cat meows and hisses
Ocean waves crashing as a man talks in the distance and wind heavily blows into a microphone
A man speaking then a baby crying, duck quacking in background and finally a woman speaking
Some groaning followed by a woman speaking
A crowd murmurs as a siren blares and then stops at a distance
A large bell rings out multiple times
A cat meowing once with a thud
A drilling sound with humming in the background
A cat meowing twice
A rolling train blows its horn multiple times
A man speaks and a vehicle passes
A woman is speaking over a microphone
A vehicle engine revving then accelerating at a high rate as a metal surface is whipped followed by tires skidding
Instrumental music playing as a woman speaks followed by rain pouring then rain falling on a surface
A mid-size motor vehicle engine accelerates and is accompanied by hissing and spinning tires, then it decelerates and an adult male begins to speak
Humming of an accelerating engine with wind passing and rustling
Humming and vibrating of a power tool with some high frequency squealing
A car is passing by with leaves rustling
Train engine as it travels
A dog barks with distant birds chirping then people speak
Some child speaking in the distant and a toilet flushing
A man talking while bongos play followed by frogs croaking
A group of people talking in the background as compressed air sprays while a tin can rattles followed by a man talking
An adult female speaks in a quiet environment
A subway train signal plays followed by a bell chiming followed by a horn honking as a crowd of people talk in the background
Wind blowing and water splashing
Loud humming followed by hissing
Thuds on floor
Waves and wind rake a shore
Clicking and sputtering of a running engine with people speaking and wind blowing
Splashing water with children speaking and people screaming with a distant blow of a whistle
A man is speaking followed by a tap and motorcycle turning on
Metal shuffling followed by plastic clicking as wind blows into a microphone
A woman speaking
A stream of water trickling as plastic clanks against a metal surface followed by water pouring down a drain alongside a camera muffling
An adult male gives a speech
An aircraft engine running then slowing down after a plastic click
A sound of vibrating motor
Male speaking, laughter and shouting and clapping
A woman speaks briefly, and a muffled engine rumbles
Plastic clacking followed by as person breathing then liquid pouring into containers
A woman and a man talking as another man talks softly and papers shuffle in the background
Someone burps and then laughs
Chirping birds near and far
A dog barking as a man is talking while birds chirp and wind blows into a microphone
A sudden horn blare as a train passes
A man is speaking while typing
Motorcycle starting then driving away
A group of men speaking as cannons fire while rain falls and water splashes followed by thunder roaring
A man speaking followed by another man speaking with some rustling
Waves roll slowly and water swirls as the wind blows
A train sounds horn while traveling on train track
Water splashing and trickling as wind blows into a microphone while a man speaks over a radio
Some rustling then silence then traffic passing in the distance with a cat meowing
A vehicle motor running idle followed by a car horn honking then a group of men groaning
A car engine is revving while driving
Cats meowing and then wind
There is a mature male talking to some animals
A man speaking on a microphone as a crowd of people laugh followed by glass clinking
An engine and speech on a loudspeaker
An idle vehicle engine running normally before stuttering
An engine revving and then tires squealing
Chainsaw being run
A man speaks with low speech in the background
Race cars are racing followed by people talking
The sound of horn from a car approaching from a distance
A long burp ends in a sigh
A man speaking as vehicles drive by and leaves rustling
A chainsaw cutting as wood is cracking
A cat meows and a woman speaks
A gun cocking then firing as metal clanks on a hard surface followed by a man talking during an electronic laser effect as gunshots and explosions go off in the distance
A male voice and a machine buzzing
Speech followed by quietness and a man speaks and laughs
A man speaks with some clinking and clanking
A man speaking as rain lightly falls followed by thunder
A bell is ringing
A woman is speaking from a microphone
Wind is blowing and heavy rain is falling and splashing
Muffled sounds followed by metal being hit
A vehicle driving as a man and woman are talking and laughing