# Data Preparation

1. **Place datasets**

   Put your datasets under the `data/` directory, for example:
   - `data/MEAD/`
   - `data/HDTF/`
   - `data/vfhq/`
   - `data/RAVDNESS/`
   - etc.

2. **Create annotation file**

   Prepare an annotation JSON file describing your dataset (paths, labels, splits, etc.), e.g.:
   - `data/annotations/demo.json`

   Follow the same structure as the provided demo annotation files in `data/annotations/`.

---

# Training

All training scripts are launched via:

    bash tools/dist_train.sh <CONFIG_PATH> <NUM_GPUS>

Below we list typical commands for each task.

## 1. Pretrain SyncLipMAE

Self-supervised pretraining on talking-face videos:

    bash tools/dist_train.sh configs/train/slipmae/slipmae_pretrain.py 8

- Uses the config at `configs/train/slipmae/slipmae_pretrain.py`.
- Launches distributed training on 8 GPUs.
- Trains the SyncLipMAE backbone with masked visual modeling and audio–visual contrastive learning.

## 2. Downstream: Classification (Action / Emotion)

### (a) Action recognition

    bash tools/dist_train.sh configs/train/slipmae/slipmae_action.py 8

- Fine-tunes SyncLipMAE for head/face action or facial behavior classification.
- Uses the pretrained backbone and adds a classification head on top.

### (b) Emotion recognition

    bash tools/dist_train.sh configs/train/slipmae/slipmae_emotion.py 8

- Fine-tunes SyncLipMAE for facial emotion recognition.
- Uses the pretrained backbone with an emotion classification head.

## 3. Downstream: Visual Speech Recognition (VSR)

    bash tools/dist_train.sh configs/train/espnet_slipmae/espnet_slipmae.py 8

- Trains a VSR model by combining SyncLipMAE visual features with an ESPnet-based speech decoder.
- The config controls dataset, tokenizer, and decoder architecture.

## 4. Downstream: Visual Dubbing

    bash tools/dist_train.sh configs/train/wanvace_slipmae/wanvacev5_slipmae_1.3b_lowerface.py 8

- Fine-tunes a WanVACE-based talking-head generator using SyncLipMAE features.
- Supports both audio-driven and video-driven visual dubbing.
- The config focuses on lower-face control and uses a ~1.3B parameter generator backbone.

---

# Inference

All inference entry points are Python scripts whose main function is exposed as `infer` and launched via:

    import fire
    fire.Fire(infer)

You can therefore run them from the command line by passing arguments as `--key=value`.

Below we summarize the usage and key arguments for each pipeline.  
(All example commands use placeholders like `<CFG>`, `<CHECKPOINT>`, `<VIDEO_PATH>` instead of specific data paths.)

## 1. AV Sync Inference (Audio–Visual Stream Synchronization)

Use the SyncLipMAE pretraining checkpoint to evaluate audio–visual synchronization, e.g.:

    python mmhug/pipelines/pipeline_slipmae_avsync.py \
        --cfg <CFG_PRETRAIN> \
        --checkpoint <CHECKPOINT_PRETRAIN> \
        --video_path <VIDEO_PATH> \
        --keypoint_path <KEYPOINT_PATH> \
        --audio_path <AUDIO_PATH_OR_NONE> \
        --dtype bfloat16

**Arguments**

- `--cfg`  
  Path to the SyncLipMAE pretraining config file  
  (e.g., `configs/train/slipmae/slipmae_pretrain.py`).

- `--checkpoint`  
  Path to the pretraining checkpoint  
  (e.g., `work_dirs/slipmae_pretrain/iter_xxx.pth`).

- `--video_path`  
  Path to the input talking-face video.

- `--audio_path` (optional)  
  Path to an external audio track.  
  If set to something like `None`, the audio track is taken from `video_path` (if available).

- `--keypoint_path`  
  Path to precomputed facial keypoints (e.g., 308-key landmarks).

- `--dtype`  
  Computation precision, e.g. `bfloat16` or `float32`.

The script computes SyncLipMAE features and evaluates AV sync metrics such as `Acc_{±K}`, `Offset`, and R-precision.

## 2. Classification Inference (Emotion / Action)

### (a) Emotion classification (face-only)

Emotion classifier fine-tuned on SyncLipMAE, e.g.:

    python mmhug/pipelines/pipeline_slipmae_emotion.py \
        --cfg <CFG_EMOTION> \
        --checkpoint <CHECKPOINT_EMOTION> \
        --video_path <VIDEO_PATH> \
        --face_only True

**Arguments**

- `--cfg`  
  Config file for emotion classification  
  (e.g., `configs/train/slipmae/slipmae_emotion_faceonly.py`).

- `--checkpoint`  
  Path to the fine-tuned emotion model checkpoint  
  (e.g., `work_dirs/slipmae_emotion_faceonly/iter_xxx.pth`).

- `--video_path`  
  Path to the input video.

- `--face_only`  
  If `True`, only the face region is used for classification.

*(Action classification is analogous, using the corresponding action config and checkpoint.)*

## 3. VSR Inference (Visual Speech Recognition)

VSR pipeline built on SyncLipMAE + ESPnet, e.g.:

    python mmhug/pipelines/pipeline_slipmae_vsr.py \
        --cfg <CFG_VSR> \
        --checkpoint <CHECKPOINT_VSR> \
        --video_path <VIDEO_PATH> \
        --dtype bfloat16

**Arguments**

- `--cfg`  
  Config file for the ESPnet–SyncLipMAE VSR model  
  (e.g., `configs/train/espnet_slipmae/espnet_slipmae.py`).

- `--checkpoint`  
  Path to the VSR checkpoint  
  (e.g., `work_dirs/espnet_slipmae/iter_xxx.pth`).

- `--video_path`  
  Path to the input talking-face video.

- `--dtype`  
  Precision for inference (`bfloat16` / `float32`).

The script extracts visual features using SyncLipMAE and decodes them into text with the ESPnet-based decoder.

## 4. Visual Dubbing Inference (Audio- / Video-Driven)

WanVACE + SyncLipMAE visual dubbing pipeline, e.g.:

    python mmhug/pipelines/pipeline_slipmae_wanvace_v5.py \
        --video_path <SOURCE_VIDEO_PATH> \
        --audio_path <DRIVING_AUDIO_PATH_OR_NONE> \
        --driven_video_path <DRIVING_VIDEO_PATH> \
        --prompt "Adjust the speaker’s mouth shapes based on the input audio." \
        --cfg <CFG_WANVACE> \
        --checkpoint <CHECKPOINT_WANVACE> \
        --output_path <OUTPUT_DIR> \
        --guidance_scale 1.0 \
        --num_ref_img 1 \
        --num_inference_steps 50

**Arguments**

- `--video_path`  
  Source video whose identity, pose, and background are to be preserved.

- `--audio_path` (optional)  
  Driving audio.  
  - If provided, the model performs **audio-driven** dubbing.  
  - If set to something like `None`, you can use `--driven_video_path` for **video-driven** dubbing.

- `--driven_video_path` (optional)  
  Driving video whose mouth motion / lip dynamics are to be transferred.

- `--prompt`  
  High-level textual instruction for the dubbing behavior  
  (e.g., how strictly to follow the audio, or which region to focus on).

- `--cfg`  
  Config for the WanVACE + SyncLipMAE dubbing model  
  (e.g., `configs/train/wanvace_slipmae/wanvacev5_slipmae_1.3b_lowerface.py`).

- `--checkpoint`  
  Path to the visual dubbing checkpoint  
  (e.g., `work_dirs/wanvacev5_slipmae_1.3b_lowerface/iter_xxx.pth`).

- `--output_path`  
  Directory to save the generated videos.

- `--guidance_scale`  
  Guidance strength controlling how strongly the generation follows the conditioning (audio / video / prompt).

- `--num_ref_img`  
  Number of reference frames used to establish appearance.

- `--num_inference_steps`  
  Number of diffusion inference steps (quality vs. speed trade-off).
