Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition


Abstract

We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions, and calls the agent for audio generation. Consequently, Audio-Agent generates high-quality audio that is closely aligned with the provided text or video while also supporting variable-length generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio, a process that can be tedious and time-consuming. We propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions to bridge video and audio modality. Thus our framework provides a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.



Figure 2: Overview of the TTA part. We use GPT-4 to convert a complex audio generation process into multiple generation steps and combine inference results.



Figure 3: Overview of the generation backbone. We build on top of the pre-trained Auffusion model for both TTA and VTA generation.




Text-to-Audio generation

Single Caption:

Young children are whistling and laughing Plastic clanking as a horse trots and a woman talks in the background A child laughs, a man speaks, and people laugh
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
People speaking with loud bangs followed by a slow motion rumble A man speaks followed by loud snoring A man whistling followed by a man yelling as plastic rustles and clanks in the background
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
A clock chimes and ticktocks
AudioGen:
AudioLDM2:
Auffusion:
Our Method:

Two Captions:

Repetitive faint snoring followed by two men speaking A croaking frog with brief bird chirps followed by a man talking as birds chirp in the background followed by a loud popping Pigeons cooing and bird wings flapping as footsteps shuffle on paper followed by motor sounds with male speaking
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
A vehicle accelerating in the distance then driving by followed by multiple gunfire sounds, and men speak Whistling and then a female singing followed by woman speaking in a quiet environment Distant thumping with some lights wind followed by water splashing occurs while a person quacks to imitate a duck and an adult female laughs
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
A female voice and a duck quacking followed by wind noise on microphone with waves splashing in the background
AudioGen:
AudioLDM2:
Auffusion:
Our Method:

Complex Captions:

A man enters his car and drives away A couple decorates a room, hangs pictures, and admires their work A woman packs a suitcase, locks her house, and walks to the bus station
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
AudioGen:
AudioLDM2:
Auffusion:
Our Method:


Video-to-Audio generation

FoleyCrafter

Ours
FoleyCrafter

Ours
FoleyCrafter

Ours
FoleyCrafter

Ours
FoleyCrafter

Ours
FoleyCrafter

Ours


Conversation example

We provide audio output for Figure 4

A man enters his car and drives away. Add "a man talks" Edit "driving away" by "playing loud music"
Combination with reasoning the missing part


Long audio example

We provide audio output for Figure 5

A river stream of water flowing followed by typing on a computer keyboard A vehicle engine revving then accelerating at a high rate as a metal surface is whipped followed by tires skidding followed by a door shutting and a female speaking A woman delivering a speech followed by a male speech and static
Auffusion:
Our Method:
Auffusion:
Our Method:
Auffusion:
Our Method:
Continuous white noise followed by a vehicle driving as a man and woman are talking and laughing
Auffusion:
Our Method: