<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <meta name="generator" content="Hugo 0.66.0" />
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link href="https://fonts.googleapis.com/css?family=Roboto:300,400,600" rel="stylesheet" type="text/css">
  <link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@4.0.0/dist/css/bootstrap.min.css" integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
  <title>Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition</title>
</head>

<body rightmargin="150" leftmargin="150" topmargin="100" bottommargin="100" line-height:160%>
  <font size="5">

    <div class="container">

      <header role="banner">

      </header>
      <main role="main">
        <article itemscope itemtype="https://schema.org/BlogPosting">
          <br></br>
          <h1 itemprop="headline" align="center">
            <font color="000093" size="7">Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition</br>
            </font>
          </h1>

          <section itemprop="entry-text">
            <br>
            <h2 id="abstract">
              <font color="000093">Abstract</font>
            </h2>
            <p style="text-align: justify;">
              <font color="061E61"> We introduce Audio-Agent, a multimodal framework for audio generation, editing and 
                composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often 
                make single-pass inferences from text descriptions. While straightforward, this design struggles to 
                produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained 
                TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the 
                text condition into atomic, specific instructions, and calls the agent for audio generation. Consequently,
                Audio-Agent generates high-quality audio that is closely aligned with the provided text or video while 
                also supporting variable-length generation. For video-to-audio (VTA) tasks, most existing methods require
                training a timestamp detector to synchronize video events with generated audio, a process that can be tedious 
                and time-consuming. We propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), 
                e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions to bridge video and audio modality. 
                Thus our framework provides a comprehensive solution for both TTA and VTA tasks without substantial 
                computational overhead in training. </font>
            </p>

            <br></br>
            <figure>
              <p align="center"><img src="assets/arch.png" width="100%" class="center" /></p>
              <figcaption>
                <p style="text-align: justify;">
                  <font color="061E61"><b>Figure 2:</b> Overview of the TTA part. We use GPT-4 to convert a complex 
                    audio generation process into multiple generation steps and combine inference results.
                </p>
              </figcaption>
            </figure>

            <br></br>
            
            <figure>
              <p align="center"><img src="assets/backbone.png" width="100%" class="center" /></p>
              <figcaption>
                <p style="text-align: justify;">
                  <font color="061E61"><b>Figure 3:</b> Overview of the generation backbone. We build on top of 
                    the pre-trained Auffusion model for both TTA and VTA generation.
                </p>
              </figcaption>
            </figure>

            <br>
            <div class="container">
              <h2 id="model-overview" style="text-align: left;">Table of Contents</h2>

              <body>
                <p style="text-align: left;">
                <ul style="list-style: outside none none !important;">
                  <li><a href="#TTA">Text-to-Audio generation</a></li>
                  <li><a href="#VTA">Video-to-Audio generation</a></li>
                  <li><a href="#CONV">Conversation example</a></li>
                  <li><a href="#LONG">Long audio example</a></li>
                </ul>
                </p>
              </body>
            </div>


            <br></br>
            <h2 id="TTA">
              <font color="000093">Text-to-Audio generation</font>
            </h2>
            <p><b></b>
              <font color="061E61">Single Caption:</font>
            </b></p>
            
            <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
              <thead>
                <tr>
                  <td scope="col" width="33%">
                    <font color="061E61">Young children are whistling and laughing</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">Plastic clanking as a horse trots and a woman talks in the background</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">A child laughs, a man speaks, and people laugh</font>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/1.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/1.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/1.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/1.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/2.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/2.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/2.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/2.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/3.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/3.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/3.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/3.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                </tr>
              </tbody>
            </table>
            


            <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
              <thead>
                <tr>
                  <td scope="col" width="33%">
                    <font color="061E61">People speaking with loud bangs followed by a slow motion rumble</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">A man speaks followed by loud snoring</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">A man whistling followed by a man yelling as plastic rustles and clanks in the background</font>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/4.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/4.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/4.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/4.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/5.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/5.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/5.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/5.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/6.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/6.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/6.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/6.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                </tr>
              </tbody>
            </table>


            <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
              <thead>
                <tr>
                  <td scope="col" width="33%">
                    <font color="061E61">A clock chimes and ticktocks</font>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/7.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/7.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/7.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/7.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                </tr>
              </tbody>
            </table>
            
            <p><b>
              <font color="061E61">Two Captions:</font>
            </b></p>

            <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
              <thead>
                <tr>
                  <td scope="col" width="33%">
                    <font color="061E61">Repetitive faint snoring followed by two men speaking</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">A croaking frog with brief bird chirps followed by a man talking as birds chirp in the background followed by a loud popping</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">Pigeons cooing and bird wings flapping as footsteps shuffle on paper followed by motor sounds with male speaking</font>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/8.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/8.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/8.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/8.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/9.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/9.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/9.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/9.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/10.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/10.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/10.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/10.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                </tr>
              </tbody>
            </table>


            <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
              <thead>
                <tr>
                  <td scope="col" width="33%">
                    <font color="061E61">A vehicle accelerating in the distance then driving by followed by multiple gunfire sounds, and men speak</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">Whistling and then a female singing followed by woman speaking in a quiet environment</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">Distant thumping with some lights wind followed by water splashing occurs while a person quacks to imitate a duck and an adult female laughs</font>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/11.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/11.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/11.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/11.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/12.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/12.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/12.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/12.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/13.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/13.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/13.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/13.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                </tr>
              </tbody>
            </table>

            <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
              <thead>
                <tr>
                  <td scope="col" width="33%">
                    <font color="061E61">A female voice and a duck quacking followed by wind noise on microphone with waves splashing in the background</font>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/14.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/14.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/14.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/14.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                </tr>
              </tbody>
            </table>

            <p><b>
              <font color="061E61">Complex Captions:</font>
            </b></p>
            <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
              <thead>
                <tr>
                  <td scope="col" width="33%">
                    <font color="061E61">A man enters his car and drives away</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">A couple decorates a room, hangs pictures, and admires their work</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">A woman packs a suitcase, locks her house, and walks to the bus station</font>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/15.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/15.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/15.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/15.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/16.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/16.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/16.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/16.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                  <td>
                    <div>AudioGen:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioGen/17.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>AudioLDM2:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/AudioLDM2/17.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Auffusion:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Auffusion/17.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                    <div>Our Method:</div>
                    <audio controls="controls">
                      <source src="assets/TTA/Ours/17.wav" autoplay />Your browser does not support the audio element.
                    </audio>
                  </td>
                </tr>
              </tbody>
            </table>

            <br></br>
            <h2 id="VTA">
              <font color="000093">Video-to-Audio generation</font>
            </h2>

            <table class="table" align="center" style="table-layout: fixed; word-break: break-word"> 
              <tbody>
                <tr>
                  <td scope="row" style="text-align: center;">
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/FoleyCrafter/1.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>FoleyCrafter</figcaption>
                    </figure>
                    <br>
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/Ours/1.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>Ours</figcaption>
                    </figure>
                  </td>

                  <td scope="row" style="text-align: center;">
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/FoleyCrafter/2.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>FoleyCrafter</figcaption>
                    </figure>
                    <br>
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/Ours/2.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>Ours</figcaption>
                    </figure>
                  </td>

                  <td scope="row" style="text-align: center;">
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/FoleyCrafter/3.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>FoleyCrafter</figcaption>
                    </figure>
                    <br>
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/Ours/3.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>Ours</figcaption>
                    </figure>
                  </td>
                </tr>
              </tbody>
            </table>
            </table>

            <table class="table" align="center" style="table-layout: fixed; word-break: break-word"> 
              <tbody>
                <tr>
                  <td scope="row" style="text-align: center;">
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/FoleyCrafter/4.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>FoleyCrafter</figcaption>
                    </figure>
                    <br>
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/Ours/4.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>Ours</figcaption>
                    </figure>
                  </td>

                  <td scope="row" style="text-align: center;">
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/FoleyCrafter/5.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>FoleyCrafter</figcaption>
                    </figure>
                    <br>
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/Ours/5.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>Ours</figcaption>
                    </figure>
                  </td>

                  <td scope="row" style="text-align: center;">
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/FoleyCrafter/6.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>FoleyCrafter</figcaption>
                    </figure>
                    <br>
                    <figure>
                      <video controls autoplay width="300">
                        <source src="assets/VTA/Ours/6.mp4" type="video/mp4">
                        Your browser does not support the video element.
                      </video>
                      <figcaption>Ours</figcaption>
                    </figure>
                  </td>

                </tr>
              </tbody>
            </table>
            </table>
            
            <br></br>
            <h2 id="CONV">
              <font color="000093">Conversation example</font>
            </h2>

            <figure></figure>
              <p align="center"><img src="assets/conversation_example.png" width="100%" class="center" /></p>
              <figcaption>
                <p style="text-align: justify;">
                  <font color="061E61">We provide audio output for <b>Figure 4</b> 
                </p>
              </figcaption>
            </figure>

            <table class="table" align="center" style="table-layout: fixed;word-break:break-word">
              <thead>
                <tr>
                  <td scope="col" width="33%">
                    <font color="061E61">A man enters his car and drives away.</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">Add "a man talks"</font>
                  </td>
                  <td scope="col" width="33%">
                    <font color="061E61">Edit "driving away" by "playing loud music"</font>
                  </td>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td scope="row"><audio controls="controls">
                      <source src="assets/TTA/conversation_demo/1.wav" autoplay />Your browser does not support the audio element.
                    </audio></td>
                  <td><audio controls="controls">
                      <source src="assets/TTA/conversation_demo/2.wav" autoplay />Your browser does not support the audio element.
                    </audio></td>
                  <td><audio controls="controls">
                      <source src="assets/TTA/conversation_demo/3.wav" autoplay />Your browser does not support the audio element.
                    </audio></td>
                </tr>
              </tbody>
            </table>

            <table class="table" align="center" style="table-layout: fixed;word-break:break-word">
            <thead>
              <tr>
                <td scope="col" width="33%">
                  <font color="061E61">Combination with reasoning the missing part</font>
                </td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td scope="row"><audio controls="controls">
                    <source src="assets/TTA/conversation_demo/4.wav" autoplay />Your browser does not support the audio element.
                  </audio></td>
              </tr>
            </tbody>
          </table>

          <br></br>
          <h2 id="LONG">
            <font color="000093">Long audio example</font>
          </h2>
          <p style="text-align: justify;">
            <font color="061E61">We provide audio output for <b>Figure 5</b> 
          </p>

          <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
            <thead>
              <tr>
                <td scope="col" width="33%">
                  <font color="061E61">A river stream of water flowing followed by typing on a computer keyboard</font>
                </td>
                <td scope="col" width="33%">
                  <font color="061E61">A vehicle engine revving then accelerating at a high rate as a metal surface is whipped followed by tires skidding followed by a door shutting and a female speaking</font>
                </td>
                <td scope="col" width="33%">
                  <font color="061E61">A woman delivering a speech followed by a male speech and static</font>
                </td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td>
                  <div>Auffusion:</div>
                  <audio controls="controls">
                    <source src="assets/TTA/Auffusion_20s/1.wav" autoplay />Your browser does not support the audio element.
                  </audio>
                  <div>Our Method:</div>
                  <audio controls="controls">
                    <source src="assets/TTA/Ours_20s/1.wav" autoplay />Your browser does not support the audio element.
                  </audio>
                </td>
                <td>
                  <div>Auffusion:</div>
                  <audio controls="controls">
                    <source src="assets/TTA/Auffusion_20s/2.wav" autoplay />Your browser does not support the audio element.
                  </audio>
                  <div>Our Method:</div>
                  <audio controls="controls">
                    <source src="assets/TTA/Ours_20s/2.wav" autoplay />Your browser does not support the audio element.
                  </audio>
                </td>
                <td>
                  <div>Auffusion:</div>
                  <audio controls="controls">
                    <source src="assets/TTA/Auffusion_20s/3.wav" autoplay />Your browser does not support the audio element.
                  </audio>
                  <div>Our Method:</div>
                  <audio controls="controls">
                    <source src="assets/TTA/Ours_20s/3.wav" autoplay />Your browser does not support the audio element.
                  </audio>
                </td>
              </tr>
            </tbody>
          </table>

          <table class="table" align="center" style="table-layout: fixed; word-break: break-word">
            <thead>
              <tr>
                <td scope="col" width="33%">
                  <font color="061E61">Continuous white noise followed by a vehicle driving as a man and woman are talking and laughing</font>
                </td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td>
                  <div>Auffusion:</div>
                  <audio controls="controls">
                    <source src="assets/TTA/Auffusion_20s/4.wav" autoplay />Your browser does not support the audio element.
                  </audio>
                  <div>Our Method:</div>
                  <audio controls="controls">
                    <source src="assets/TTA/Ours_20s/4.wav" autoplay />Your browser does not support the audio element.
                  </audio>
                </td>
              </tr>
            </tbody>
          </table>

          </section>
        </article>
      </main>

    </div>

    <script>
      (function (i, s, o, g, r, a, m) {
        i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
          (i[r].q = i[r].q || []).push(arguments)
        }, i[r].l = 1 * new Date(); a = s.createElement(o),
          m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
      })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');
      ga('create', 'UA-139981676-1', 'auto');
      ga('send', 'pageview');
    </script>

    <script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/highlight.min.js"></script>
    <script>hljs.initHighlightingOnLoad();</script>



    <script type="text/x-mathjax-config">
     MathJax.Hub.Config({
         HTML: ["input/TeX","output/HTML-CSS"],
         TeX: {
                Macros: {
                         bm: ["\\boldsymbol{#1}", 1],
                         argmax: ["\\mathop{\\rm arg\\,max}\\limits"],
                         argmin: ["\\mathop{\\rm arg\\,min}\\limits"]},
                extensions: ["AMSmath.js","AMSsymbols.js"],
                equationNumbers: { autoNumber: "AMS" } },
         extensions: ["tex2jax.js"],
         jax: ["input/TeX","output/HTML-CSS"],
         tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
                    displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
                    processEscapes: true },
         "HTML-CSS": { availableFonts: ["TeX"],
                       linebreaks: { automatic: true } }
     });
 </script>

    <script type="text/x-mathjax-config">
     MathJax.Hub.Config({
       tex2jax: {
         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
       }
     });
 </script>

    <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-MML-AM_CHTML">
    </script>




</body>

</html>