Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Abstract
We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions, and calls the agent for audio generation. Consequently, Audio-Agent generates high-quality audio that is closely aligned with the provided text or video while also supporting variable-length generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio, a process that can be tedious and time-consuming. We propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions to bridge video and audio modality. Thus our framework provides a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.
Figure 2: Overview of the TTA part. We use GPT-4 to convert a complex audio generation process into multiple generation steps and combine inference results.
Figure 3: Overview of the generation backbone. We build on top of the pre-trained Auffusion model for both TTA and VTA generation.
Table of Contents
Text-to-Audio generation
Single Caption:
Young children are whistling and laughing | Plastic clanking as a horse trots and a woman talks in the background | A child laughs, a man speaks, and people laugh |
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
People speaking with loud bangs followed by a slow motion rumble | A man speaks followed by loud snoring | A man whistling followed by a man yelling as plastic rustles and clanks in the background |
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
A clock chimes and ticktocks |
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
Two Captions:
Repetitive faint snoring followed by two men speaking | A croaking frog with brief bird chirps followed by a man talking as birds chirp in the background followed by a loud popping | Pigeons cooing and bird wings flapping as footsteps shuffle on paper followed by motor sounds with male speaking |
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
A vehicle accelerating in the distance then driving by followed by multiple gunfire sounds, and men speak | Whistling and then a female singing followed by woman speaking in a quiet environment | Distant thumping with some lights wind followed by water splashing occurs while a person quacks to imitate a duck and an adult female laughs |
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
A female voice and a duck quacking followed by wind noise on microphone with waves splashing in the background |
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
Complex Captions:
A man enters his car and drives away | A couple decorates a room, hangs pictures, and admires their work | A woman packs a suitcase, locks her house, and walks to the bus station |
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
AudioGen:
AudioLDM2:
Auffusion:
Our Method:
|
Video-to-Audio generation
|
|
|
|
|
|
Conversation example
We provide audio output for Figure 4
A man enters his car and drives away. | Add "a man talks" | Edit "driving away" by "playing loud music" |
Combination with reasoning the missing part |
Long audio example
We provide audio output for Figure 5
A river stream of water flowing followed by typing on a computer keyboard | A vehicle engine revving then accelerating at a high rate as a metal surface is whipped followed by tires skidding followed by a door shutting and a female speaking | A woman delivering a speech followed by a male speech and static |
Auffusion:
Our Method:
|
Auffusion:
Our Method:
|
Auffusion:
Our Method:
|
Continuous white noise followed by a vehicle driving as a man and woman are talking and laughing |
Auffusion:
Our Method:
|