# MEMO

**MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation**

<div align="center">
    <img src="assets/demo.gif" alt="Demo GIF" width="100%">
</div>

## Installation

```bash
conda create -n memo python=3.10 -y
conda activate memo
conda install -c conda-forge ffmpeg -y
pip install -e .
```

## Inference

```bash
python inference.py --config configs/inference.yaml --input_image <IMAGE_PATH> --input_audio <AUDIO_PATH> --output_dir <SAVE_PATH>
```

For example:

```bash
python inference.py --config configs/inference.yaml --input_image assets/examples/dicaprio.jpg --input_audio assets/examples/speech.wav --output_dir outputs
```

> We tested the code on H100 and RTX 4090 GPUs using CUDA 12. Under the default settings (fps=30, inference_steps=20), the inference time is around 1 second per frame on H100 and 2 seconds per frame on RTX 4090. We welcome community contributions to improve the inference speed or add more features.

## Finetuning Our Model

We provide a straightforward finetuning script for users to continue training on their own datasets.

### Step 1: Data Preparation

Install the dependencies for data preprocessing and finetuning:

```bash
pip install deepspeed decord wandb
```

Your training data should be in the form of video clips. The data should be organized as follows:

```plaintext
data
└── video
    ├── *.mp4
    └── ...
```

We also provide an efficient script for calculating video durations:

```bash
python scripts/calculate_durations.py data/video
```

We preprocess all audio embeddings, face embeddings, and emotion labels in advance to accelerate the training process. To preprocess the data, run the following command:

```bash
CUDA_VISIBLE_DEVICES=0 python scripts/prepare_data.py --input_dir data/video --output_dir data/embedding --misc_model_dir checkpoints
```

The preprocessed embedding will be saved in the `data/embedding` directory:

```plaintext
data
├── video
    ├── *.mp4
    └── ...
└── embedding
    ├── audio_emb
    ├── audio_emotion
    ├── face_emb
    ├── vocals
    └── metadata.jsonl
```

### Step 2: Finetuning

Run the finetuning script:

```bash
accelerate launch --config_file configs/accelerate.yaml finetune.py --config configs/finetune.yaml --exp_name finetune 2>&1 | tee outputs_finetune.log
```

To inference the finetuned model, simply replace the `model_name_or_path` in `configs/inference.yaml` with the path to the finetuned model (e.g., `outputs/finetune/checkpoint-10000`).

```bash
python inference.py --config configs/inference.yaml --input_image assets/examples/dicaprio.jpg --input_audio assets/examples/speech.wav --output_dir outputs
```

## Ethics Statement

We acknowledge the potential of AI in generating talking videos, with applications spanning education, virtual assistants, and entertainment. However, we are equally aware of the ethical, legal, and societal challenges that misuse of this technology could pose. To reduce potential risks, we have only open-sourced a preview model for research purposes. Demos on our website use publicly available materials. We welcome copyright concerns—please contact us if needed, and we will address issues promptly. Users are required to ensure that their actions align with legal regulations, cultural norms, and ethical standards. It is strictly prohibited to use the model for creating malicious, misleading, defamatory, or privacy-infringing content, such as deepfake videos for political misinformation, impersonation, harassment, or fraud. We strongly encourage users to review generated content carefully, ensuring it meets ethical guidelines and respects the rights of all parties involved. Users must also ensure that their inputs (e.g., audio and reference images) and outputs are used with proper authorization. Unauthorized use of third-party intellectual property is strictly forbidden. While users may claim ownership of content generated by the model, they must ensure compliance with copyright laws, particularly when involving public figures' likeness, voice, or other aspects protected under personality rights.
