<div align="center">

## MTVCraft: Tokenizing 4D Motion for Arbitrary Character Animation

> Official codes of **MTVCraft**, a novel framework for general and high-quality character image animation using raw 3D motion sequences.

</div>

## 🔍 Abstract

Character image animation has rapidly advanced with the rise of digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 4D information for open-world animation. To address this, we propose MTVCraft (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for character image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose images and the character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for character image animation in the complex 4D world. We implement MTVCraft on both CogVideoX-5B (small scale) and Wan-2.1-14B (large scale), demonstrating that our framework is easily scalable and can be applied to models of varying sizes. Experiments on the TikTok and Fashion benchmarks demonstrate our state-of-the-art performance. Moreover, powered by robust motion tokens, MTVCraft showcases unparalleled zero-shot generalization. It can animate arbitrary characters in both single and multiple settings, in full-body and half-body forms, and even non-human objects across diverse styles and scenarios. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided video generation.

---

## 🛠️ Installation

We recommend using a clean Python environment (Python 3.10+).

```bash
# Create virtual environment
conda create -n mtvcraft python=3.11
conda activate mtvcraft

# Install dependencies
pip install -r requirements.txt
```

---

## 🚀 Usage

To animate a character image with a given 3D motion sequence,  
you first need to prepare SMPL motion-video pairs. You need to extract SMPL motion sequences from your own driving video using:

```bash
python process_nlf.py "your_video_directory"
```

This will generate a motion-video `.pkl` file under `"your_video_directory"`.

---

#### ▶️ Inference of MV-DiT-7B
```bash
python infer_7b.py \
    --ref_image_path "ref_images/human.png" \
    --motion_data_path "data/sampled_data.pkl" \
    --output_dir "inference_output"
```

#### ▶️ Inference of MV-DiT-17B (with text control)
```bash
python infer_17b.py \
    --ref_image_path "ref_images/woman.png" \
    --motion_data_path "data/sampled_data.pkl" \
    --output_dir "inference_output" \
    --prompt "The woman is dancing on the beach, waves, sunset."
```

**Arguments:**

- `--ref_image_path`: Path to the reference character image.
- `--motion_data_path`: Path to the SMPL motion sequence (.pkl format).
- `--output_dir`: Directory to save the generated video.
- `--prompt` (optional): Text prompt describing the scene or style.

---

### 🏋️‍♂️ Training 4DMoT

To train the 4DMoT tokenizer on your own dataset:

```bash
accelerate launch train_vqvae.py
```

### 🎬 Visualizations

Please refer to the floder `generated_samples` to see our amazing animated videos. We provide 30 cases in this folder.