# **AnyExpress: One Adapter Enabling Highly Flexible Audio-Driven Portrait Animation**

---

## Introduction

Portrait animation, particularly audio-driven portrait animation, requires flexibility in facial expressions, head movement, and dynamic contexts. However, existing diffusion-based methods rely heavily on the design of ReferenceNet, leading to increased training complexity and incompatibility with other custom base models or adapters, also limiting face position, view changes, and animated context generation. To address these challenges, we propose ***AnyExpress***, a lightweight, modular framework that eliminates the need for ReferenceNet, reducing the number of trainable parameters by **7** times. By training one plug-and-play *audio-motion adapter*, it allows freeform, expressive audio-driven portrait animation with any face pose and any animated context, while supporting text-driven modifications. In the context of character generation, there are two primary methods to control the desired character attributes. First, if a specific ID needs to be assigned, this can be achieved through ID controls (*e.g.*, IP-Adapter-Face). Alternatively, the character’s attributes can be controlled through textual descriptions. Through comprehensive qualitative and quantitative analyses, ***AnyExpress*** demonstrates unprecedented freedom in generating videos with dynamic background, lower training demand, and seamless integration with evolving custom models and control adapters, providing a flexible solution for diverse generation needs. The demo is available at https://anyexpress-alpha.github.io/Any, and we will release our code, encouraging further improvement.

<img width="1000" alt="Teaser" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/pics/Teaser_v2.jpg">

<img width="1000" alt="global_framework" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/pics/pipeline_modular_v2.jpg">


## Installation

```
pip install -r requirements.txt
```

Configure `accelerate`:
```shell
pip install accelerate
accelerate config default
vim ~/.cache/huggingface/accelerate/default_config.yaml
```
Then set the corresponding number of processors `num_processes` according to your computing machine:
```yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```


## How to Use

### Run the demo (step1, _optional_)

If you have a target talking video, you can follow the script below to extract the audio and face V-kps sequences from the video. You can also skip this step and run the script in Step 2 directly to try the example we provided.

```shell
python scripts/extract_kps_sequence_and_audio.py \
    --video_path "./samples/short_case/10/gt.mp4" \
    --kps_sequence_save_path "./test_samples/short_case/10/kps.pth" \
    --audio_save_path "./test_samples/short_case/10/aud.mp3"
```

We recommend cropping a clear square face image as in the example below and making sure the resolution is no lower than 512x512. The green to red boxes in the image below are the recommended cropping ranges.

<img width="500" alt="crop_example" src="https://github.com/tencent-ailab/V-Express/assets/19601425/7c1d8df4-7267-46c7-a848-5130476467ef">

### Training

1. Stage 1 training
The training of the stage 1 is configured with `configs/train/stage_2-all.yaml`. The training can be executed via:
```shell
accelerate launch train.py --config ./configs/train/stage_2-motion_audio.yaml
```

2. Stage 2 training
The training of the stage 2 is configured with `configs/train/stage_2-all.yaml`. The training can be executed via:
```shell
accelerate launch train.py --config ./configs/train/stage_2-all.yaml
```


### Run the demo (step2, _core_)


```shell
sh scripts/bash/infer/infer-stage2-t2i-all-meanvar-face.sh
```

## Experimental Results

### Any Face Pose
<tr>
    <td colspan="4" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyFacePose/same_kps-0A8lj.mp4" style="width: 100%; height: auto;"></video>
    </td>
</tr>

<tr>
    <td colspan="4" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyFacePose/same_kps-W5Md.mp4" style="width: 100%; height: auto;"></video>
    </td>
</tr>


<tr>
    <td colspan="4" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyFacePose/same_kps-qbAvR.mp4" style="width: 100%; height: auto;"></video>
    </td>
</tr>

### Any Animated Background

<tr>
    <td colspan="4" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/animated_bg/Adriana/6eH2B-breaking_waves.mp4" style="width: 23%; height: auto;"></video>
    </td>
    <td colspan="3" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/animated_bg/Adriana/joker-dancing_fire.mp4" style="width: 23%; height: auto;"></video>
    </td>
    <td colspan="3" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/animated_bg/Adriana/kara-bllowing_sails.mp4" style="width: 23%; height: auto;"></video>
    </td>
    <td colspan="3" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/animated_bg/Adriana/mola-erupting_volcano.mp4" style="width: 23%; height: auto;"></video>
    </td>
</tr>

### Any Personalized T2I models

Left to right: Realistic V6, chunckycat, lustermix v15, and toonyou beta6.
<tr>
    <td colspan="4" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/personalize/tys.mp4" style="width: 100%; height: auto;"></video>
    </td>
</tr>

### Any Text Control

<tr>
    <td colspan="4" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyText/mechgirl-talk_emotion-swirling_mist.mp4" style="width: 30%; height: auto;"></video>
    </td>
    <td colspan="3" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyText/biden-talk_emo-sparkling_fireworks.mp4" style="width: 30%; height: auto;"></video>
    </td>
    <td colspan="3" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyText/bride-tys-people.mp4" style="width: 30%; height: auto;"></video>
    </td>
</tr>

<tr>
    <td colspan="4" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyText/kara-10-dancing_fire.mp4" style="width: 30%; height: auto;"></video>
    </td>
    <td colspan="3" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyText/tys-tys-leaves.mp4" style="width: 30%; height: auto;"></video>
    </td>
    <td colspan="3" style="text-align:center;">
      <video muted="" autoplay="autoplay" loop="loop" src="https://anyexpress.oss-cn-beijing.aliyuncs.com/AnyText/tys-tys-volcano.mp4" style="width: 30%; height: auto;"></video>
    </td>
</tr>
