<!doctype html>
<html lang="en">
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><link href="./Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation_files/chalkduster" rel="stylesheet">
<style>
	@import url('https://fonts.cdnfonts.com/css/chalkduster');
</style>
<head>
	<!-- Required meta tags -->
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

	<!-- Bootstrap CSS -->
	<link href="./html_pages/resources/bootstrap.min.css" rel="stylesheet">
	<link href="./html_pages/resources/stylesheet.css" rel="stylesheet">

	<title>Supplementary Website</title>
</head>
<!-- <style>
	body {
    	margin-left: 20px; /* Adjust the value as desired */
  	}
</style> -->
<!-- <body> -->

<body data-new-gr-c-s-check-loaded="14.1110.0" data-gr-ext-installed="">

		<section class="jumbotron text-center pb-2">
			<div class="container">
				<h1 class="jumbotron-heading">IMAC: Implicit Motion-Audio Coupling <br> for Co-Speech Gesture Video Generation</h1>

				<h4 class="font-italic pt-2" style="font-weight: normal">ICLR 2025 [Submission ID: 6198]</h4>

			</div>
		</section>

		<div class="container">

			<div class="row pt-1 justify-content-sm-center">
				
				
				<a class="sm-1 mx-1 btn btn-primary text-center" href="./index.html" role="button">Videos for Figure 5 and Figure 6</a>
				<!-- <a class="sm-1 mx-1 btn btn-primary text-center" href="./html_pages/index_audio2video.html" role="button">Our Results (Mesh to Video)</a> -->
				<!-- <a class="sm-1 mx-1 btn btn-primary" href="dataset.html" role="button">Dataset</a> -->

			</div>			
			<div class="row pt-3 text-center">
				  
				<!-- <h6 class="col-sm-12">Action Module</h6> -->


			</div>		
			<div class="row justify-content-sm-center">
				  

				<a class="sm-1 mx-1 btn btn-primary" href="./html_pages/comparisons.html" role="button">More Videos for Comparisons</a>
				<a class="sm-1 mx-1 btn btn-primary" href="./html_pages/ablations.html" role="button">More Videos for Ablation Studies</a>
				<a class="sm-1 mx-1 btn btn-primary" href="./html_pages/indentities.html" role="button">Videos for Other Identities</a>

			</div>		
			<!-- <div class="row pt-3 text-center">
				  
				<h6 class="col-sm-12">Synthesis Module</h6>


			</div>		
			<div class="row pt-1 justify-content-sm-center">
				  
				<a class="sm-1 mx-1 btn btn-primary" href="camera_manipulation.html" role="button">Camera Manipulation</a>
				<a class="sm-1 mx-1 btn btn-primary" href="style_manipulation.html" role="button">Style Manipulation</a>
				<a class="sm-1 mx-1 btn btn-primary" href="baselines_synthesis.html" role="button">Baselines</a>
				<a class="sm-1 mx-1 btn btn-primary" href="ablation_synthesis.html" role="button">Ablation</a>

			</div>		 -->




			<hr class="mt-5">


			
			<h2 class="pt-4"><p class="text-center">Videos for Figure 5 and 6</p></h2>

			<p class="lead">On this page, we present the videos corresponding to Figures 5 and 6. As shown in the video for Figure 5, our method produces high-quality videos without blurry hands or finger distortion and maintains a consistent background. In contrast, S2G and MYA exhibit inconsistent backgrounds and suffer from blurry hands and distorted fingers. Additionally, MYA often memorizes appearance features during training. This causes the generated videos to replicate the memorized appearance instead of using the reference image, resulting in inconsistencies. More comparison videos are provided on the "More Videos for Comparisons" page.
				<br><br>
				In the video for Figure 6, the incomplete model versions suffer from low visual quality, background inconsistencies with the reference image, distorted hands, extra fingers, and hands that appear detached from the body. Moreover, the generated videos show significant motion inconsistencies, with severe motion shaking. Additional videos for the ablation studies are available on the "More Videos for Ablation Studies" page. 
				<br><br>
				Please ensure to play the audio in each video to hear the input speech.
			</p>
			<!-- <p><strong>Note:</strong> The background of each actor in the training sequnece is not constant (moves because camera is not still across entire training video). This leads to sight change in background in the generated results.</p> -->

			<!-- Our results in table from plug-and-play -->
			<table width="1200" style="margin-left: -55px;" align="center">
				<tbody>
					<!-- <th>A</th>
					<th>B</th>
					<th>C</th> -->
						<tr>
	

							<th style="text-align: center; padding: 10px;">
								<!-- <div class="row justify-content-sm-center">
									<p style="font-family: Chalkduster; font-size: 16px; margin-top: 5px; padding-left: 50px">Input Mesh</p>
									<p style="font-family: Chalkduster; font-size: 16px; margin-top: 5px; padding-left: 120px">Textured Mesh</p>
									<p style="font-family: Chalkduster; font-size: 16px; margin-top: 5px; padding-left: 120px">Generated Video</p>
									<p style="font-family: Chalkduster; font-size: 16px; margin-top: 5px; padding-left: 100px">Ground Truth Video</p>
								</div> -->
								<div style="width: 1000px; margin: auto; border-bottom: 1px solid #000;">
									<div class="row justify-content-sm-center" style="display: flex; width: 1000px; height: 40px; margin-left: 0px">
										<div style="flex: 0 0 25%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">GT</p>
										</div>
										<div style="flex: 0 0 25%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">S2G</p>
										</div>
										<div style="flex: 0 0 25%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">MYA</p>
										</div>
										<div style="flex: 0 0 25%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">Ours</p>
										</div>
									</div>
									<video autoplay="autoplay" controls="controls" loop="loop" muted="muted" src="fig5-sota.mp4" style="width:1000px; border: 5px solid #000;" type="video/mp4"></video>
									<p style="font-family: Chalkduster; font-size: 20px; margin-top: 5px;">Figure 5</p>
								</div>
							</th>
						</tr>
						
						<tr> <td><br></td> </tr>
								
							<th style="text-align: center; padding: 10px;">
								<!-- <div class="row justify-content-sm-center">
									<p style="font-family: Chalkduster; font-size: 16px; margin-top: 5px; padding-left: 50px">Input Mesh</p>
									<p style="font-family: Chalkduster; font-size: 16px; margin-top: 5px; padding-left: 120px">Textured Mesh</p>
									<p style="font-family: Chalkduster; font-size: 16px; margin-top: 5px; padding-left: 120px">Generated Video</p>
									<p style="font-family: Chalkduster; font-size: 16px; margin-top: 5px; padding-left: 100px">Ground Truth Video</p>
								</div> -->
								<div style="width: 1000px; margin: auto; border-bottom: 1px solid #000;">
									<div class="row justify-content-sm-center" style="display: flex; width: 1000px; height: 40px; margin-left: 0px">
										<div style="flex: 0 0 20%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">W/o Ref</p>
										</div>
										<div style="flex: 0 0 20%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">W/o Motion</p>
										</div>
										<div style="flex: 0 0 20%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">W/o First Stage</p>
										</div>
										<div style="flex: 0 0 20%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">W/o Slow-Fast</p>
										</div>
										<div style="flex: 0 0 20%; text-align: center;">
											<p style="font-family: Chalkduster; font-size: 16px; margin: 0;">Ours</p>
										</div>
									</div>
									<video autoplay="autoplay" controls="controls" loop="loop" muted="muted" src="fig6-ablation.mp4" style="width:1000px; border: 5px solid #000;" type="video/mp4"></video>
									<p style="font-family: Chalkduster; font-size: 20px; margin-top: 5px;">Figure 6</p>
								</div>
							</th>
							</tr>
						
						
						<tr> <td><br></td> </tr>


		
						
				</tbody>
			</table>
			<!------------------ END SECTION ------------------>
			<!-- <p><strong>*</strong> The hands of "Seth" and "Chemistry" in the generated video from audio input contain cloudy artifacts because (1) the number of training frames for these examples is extremely small (< 7K frames),
				(2) The SMPL-X mesh sequence generated from audio at inference can be different compared to training mesh distribution. We observe that we need more than 25K frames for training the GAN model to make it robust to out of domain mesh sequence at inference (as in case of "Oliver").</p> -->


		</div>

		<!-- Optional JavaScript -->
		<!-- jQuery first, then Popper.js, then Bootstrap JS -->
		<script src="./html_pages/resources/jquery-3.4.1.slim.min.js"></script>
		<script src="./html_pages/resources/popper.min.js"></script>
		<script src="./html_pages/resources/bootstrap.min.js"></script>

	</body>


</body><grammarly-desktop-integration data-grammarly-shadow-root="true"><template shadowrootmode="open"><style>
	div.grammarly-desktop-integration {
	  position: absolute;
	  width: 1px;
	  height: 1px;
	  padding: 0;
	  margin: -1px;
	  overflow: hidden;
	  clip: rect(0, 0, 0, 0);
	  white-space: nowrap;
	  border: 0;
	  -moz-user-select: none;
	  -webkit-user-select: none;
	  -ms-user-select:none;
	  user-select:none;
	}
  
	div.grammarly-desktop-integration:before {
	  content: attr(data-content);
	}
  </style><div aria-label="grammarly-integration" role="group" tabindex="-1" class="grammarly-desktop-integration" data-content="{&quot;mode&quot;:&quot;full&quot;,&quot;isActive&quot;:true,&quot;isUserDisabled&quot;:false}"></div></template></grammarly-desktop-integration></html>

	</html>
