# 🎥 Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy


<p align="center">
  <img src="assets/teaser.png" alt="teaser" width="680">
</p>

<p align="center">
  <a href='https://directanimator.github.io//'><img src='https://img.shields.io/badge/Project-Page-Green'></a> 
  <a href="#"><img src="https://img.shields.io/badge/python-3.10%2B-blue" alt="Python 3.10+"></a>
  <a href="#"><img src="https://img.shields.io/badge/pytorch-required-orange" alt="PyTorch required"></a>
  <a href="#"><img src="https://img.shields.io/badge/cuda-GPU_required-green" alt="CUDA GPU required"></a>
  <!-- <a href="#"><img src="https://img.shields.io/badge/license-TBD-lightgrey" alt="License"></a> -->
</p>

> **TL;DR.** DirectAnimator is a diffusion transformer (DiT) framework for human image animation that avoids explicit skeletons. It conditions denoising with a **driving cue triplet** (Pose, Face, and Location), fused in a **CueFusion DiT (CF-DiT)** block for stable and controllable motion transfer. Robust cross-identity animation is achieved via **Same2X training strategy**: Same-ID pretraining followed by Cross-ID training with pseudo driving cues and a **Same2X Alignment Loss** that aligns cross-ID features to the same-ID model.

<p align="center">
  <img src="assets/pipeline.png" alt="pipeline" width="760">
</p>

---

## 📑 Table of Contents

* [Highlights](#highlights)
* [Environment](#environment)
* [Project Layout](#project-layout)
* [Pretrained Models](#pretrained-models)
* [Data Preparation](#data-preparation-required-folder-structure)
* [Training](#training)
* [Inference](#inference)
* [Reproduction Notes](#reproduction-notes)
* [Acknowledgments](#acknowledgments)

---

## 🌟 Highlights

* **Driving cue triplet (Pose / Face / Location)**

    We propose DirectAnimator, a novel framework that directly animates reference images using driving videos. By introducing a structured driving cue triplet and integrating it via a CueFusion DiT block, our method eliminates the need for explicit pose estimation.
* **Same2X training strategy**
    
    We design the Same2X training strategy, a new learning paradigm that regularizes cross-ID supervision by leveraging internal representations learned from same-ID training, thereby enhancing generalization across identities.

---

## 🛠️ Environment

* Python ≥ 3.10, CUDA-capable GPU, PyTorch and dependencies.

```bash
pip install -r requirements.txt
```

---

## 🗂️ Project Layout

```bash
DirectAnimator/
├── examples/
│   └── cogvideox_fun/
│       └── predict_HIA.py        # Inference script
├── scripts/
│   └── cogvideox_fun/
│       ├── train_stage1.sh       # Training phase 1 (Same-ID)
│       └── train_stage2.sh       # Training phase 2 (Cross-ID)
├── models/                       # Checkpoints
│   └── Diffusion_Transformer/CogVideoX-Fun-V1.5-5b-InP # Pre-train models
├── datasets/
│   └── scripts                   # data utilities
└── README.md
```

---

## 📦 Pretrained Models

Pretrained models for DirectAnimator can be downloaded from the following links:

- [CogVideoX-Fun-V1.5-5b-InP (huggingface)](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-5b-InP)
- [CogVideoX-Fun-V1.5-5b-InP (modelscope)](https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.5-5b-InP)

Place the downloaded models in the `models/Diffusion_Transformer/` directory as shown in the project layout.

---

## 📂 Data Preparation

DirectAnimator uses two stages: Same-ID (pairs from the same clip) and Cross-ID (reference and driving identities differ). For Cross-ID training, construct pseudo driving cues (Pose / Face / Location).

**Recommended Folder Structure:**

```bash
data/
├── same_id/                                  # Stage 1: same-ID pairs from the same clip
│   ├── driving_pose/
│   │   ├── 000001.mp4
│   │   ├── 000002.mp4
│   │   └── ...
│   ├── driving_mask/
│   │   ├── 000001.mp4
│   │   ├── 000002.mp4
│   │   └── ...
│   ├── driving_face/
│   │   ├── 000001.mp4
│   │   ├── 000002.mp4
│   │   └── ...
│   ├── driving_face_mask/
│   │   ├── 000001.mp4
│   │   ├── 000002.mp4
│   │   └── ...
│   ├── reference/                        # reference frames from the clip
│   │   ├── ref_000001.jpg
│   │   ├── ref_000002.jpg
│   │   └── ...
│   └── stage1_metadata.json
│
├── cross_id/                                 # Stage 2: cross-ID with pseudo cues
│   ├── driving_pose/
│   │   ├── pseudo_000001.mp4
│   │   ├── pseudo_000002.mp4
│   │   └── ...
│   ├── driving_mask/
│   │   ├── pseudo_000001.mp4
│   │   ├── pseudo_000002.mp4
│   │   └── ...
│   ├── driving_face/
│   │   ├── pseudo_000001.mp4
│   │   ├── pseudo_000002.mp4
│   │   └── ...
│   ├── driving_face_mask/
│   │   ├── pseudo_000001.mp4
│   │   ├── pseudo_000002.mp4
│   │   └── ...
│   ├── reference/                        # reference frames from the clip
│   │   ├── ref_000001.jpg
│   │   ├── ref_000002.jpg
│   │   └── ...
│   └── stage2_metadata.json

```
**Metadata Structure**

The `metadata.json` file organizes data for training and inference. Below is an example of its structure:

```json
{
    "0": {
        "file_path": "your_path",
        "text": "a young woman with long brown hair, ...",
        "type": "video",
        "driving_path": "your_path",
        "driving_mask_path": "your_path",
        "driving_face_path": "your_path",
        "driving_face_mask_path": "your_path",
        "reference_file_path": "your_path"
    },
    "1": {"..."},
    "..."
}
```

**Key Fields:**

- `file_path`: Path to the target video file.
- `text`: Descriptive text of the scene or subject in the video.
- `type`: Type of the file (e.g., "video").
- `driving_path`: Path to the driving video used for animation.
- `driving_mask_path`: Path to the driving mask video.
- `driving_face_path`: Path to the driving face video.
- `driving_face_mask_path`: Path to the driving face mask video.
- `reference_file_path`: Path to the reference image used for animation.

This structure ensures that all necessary components for training and inference are properly linked and described.

**Notes**

* **Same-ID stage**: train on `(reference, driving)` pairs extracted from the same clip.
* **Cross-ID stage**: for each reference identity, synthesize **0–3** pseudo cues (pose / face / location).
* Example dataset scale (paper): \~**4,000** internet clips + TikTok **1–334** for Same-ID; retain **3,300** high-quality `[reference, pseudo driving cue]` pairs for Cross-ID.

---

## 🏋️ Training

### Phase 1 : Same-ID

```bash
cd scripts/cogvideox_fun
sh train_stage1.sh
```

### Phase 2 : Cross-ID

```bash
cd scripts/cogvideox_fun
sh train_stage2.sh
```

This stage uses **S2X Loss** in addition to denoising loss to align cross-ID features with the same-ID model.

**Default training recipe (paper):**

* Backbone: **CogVideoX-1.5**; Text/VAE encoders frozen; DiT updated.
* Iterations: **10K** (Same-ID) → **30K** (Cross-ID); learning rate **2e-5**; **4× H20** GPUs; bucket sampler for variable sizes.

---

## 🎬 Inference

```bash
python DirectAnimator/examples/cogvideox_fun/predict_HIA.py
```

---

## 💾 Checkpoints

By default, checkpoints are read from and saved to `models/`. Adjust paths in scripts/configs if needed.

---

## 🙏 Acknowledgments

This project builds on [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun), and uses [StableAnimator](https://francis-rings.github.io/StableAnimator) and [Face-Adapter](https://github.com/FaceAdapter/Face-Adapter) to construct pseudo driving cues for cross-ID training.
