<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, shrink-to-fit=no">
    <title>AutoLink</title>
    <link rel="stylesheet" href="css/general.css">
    <link rel="stylesheet" href="css/citation.css">
    <link rel="stylesheet" href="css/title.css">
    <meta name="google-site-verification" content="QZBN3L69Q-c1oJGtwBx3eOj6ugnkjS7Q2tZUGm-VTkA"/>
</head>

<body>
    <div class="header" id="home" style="padding-bottom: 90px;"></div>

    <section class="title">
        AutoLink: Self-supervised Learning of Human Skeletons and Object Outlines by Linking Keypoints
    </section>
    <section class="title">
        Paper ID: 1177
    </section>

    <div class="container">
        Please open this webpage in one of the following browsers, which support embedded videos:
        <ul>
            <li>Google Chrome >= 65.0.3325</li>
            <li>Mozilla Firefox >= 59.0.1</li>
            <li>Apple Safari >= 11.0.3</li>
        </ul>
    </div>

    <div class="container">
        <p>
            We visualize the learned graph representation on videos of faces and humans. While the graph is only learned on a collection of single images, they show stability and consistency on videos. Note the subtle eye and mouth shape on faces, and fine leg motion on humans.
        </p>
        <video src="asset/detection.mp4" type="video/mp4" controls muted autoplay loop>
            Your browser does not support the video tag.
        </video>
    </div>

    <!-- <div class="header" id="abstract">Abstract</div>
    <div class="line"></div>

    <div class="container">
        <p>
            Structured representations such as keypoints are widely used in pose transfer, conditional image generation, animation, and 3D reconstruction. However, their supervised learning requires expensive annotation for each target domain. We propose a self-supervised method that learns to disentangle object structure from the appearance with a graph of 2D keypoints linked by straight edges. Both the keypoint location and their pairwise edge weights are learned, given only a collection of images depicting the same object class. The graph is interpretable, for example, AutoLink recovers the human skeleton topology when applied to images showing people. Our key ingredients are i) an encoder that predicts keypoint locations in an input image, ii) a shared graph as a latent variable that links the same pairs of keypoints in every image, iii) an intermediate edge map that combines the latent graph edge weights and keypoint locations in a soft, differentiable manner, and iv) an inpainting objective on randomly masked images. Although simpler, AutoLink outperforms existing self-supervised methods on the established keypoint and pose estimation benchmarks and paves the way for structure-conditioned generative models on more diverse datasets.
        </p>
    </div>

    <h1 class="header" id="results">Detected Keypoints and Visualized Graph Representation</h1>
    <div class="line"></div>

    <div class="container">
        <img src="asset/teaser_wo_app.png">
    </div> -->

    <h1 class="header" id="results">Application: Pose Transfer</h1>
    <div class="line"></div>

    <div class="container">
        <p>
            The learned graph representation can be used to train a pose transfer network on videos. Note that also the translation models are trained on single images instead of videos (image translation, not video translation). Nevertheless, it is stable even when the pose change is large (first row), and the subtle details, such as mouth motion (third row) and eye blinking (forth row), are also captured.
        </p>
        <!-- <ul>
            <li>First row: talking facing right; large pose changes</li>
            <li>Second row: talking facing left</li>
            <li>Third row: mouth shape is capture very well</li>
            <li>Forth row: the subtle eye blinking is also captured well</li>
        </ul> -->
        <video src="asset/vox_transfer.mp4" type="video/mp4" controls muted autoplay loop>
            Your browser does not support the video tag.
        </video>
    </div>

    <h1 class="header" id="results">Application: Conditional GAN</h1>
    <div class="line"></div>
    <div class="container">
        <p>
            The learned graph representation can be used to train a conditional GAN. In this experiment, we trained a single detector and GAN on AFHQ, where multiple animals are trained at the same time. This demonstrates the robustness to shape variations and capability of learning a shared animal head model.
        </p>
        <img src="asset/con_gan.png">
    </div>

</body>

</html>
