# LUMA – Low-Dimension Unified Motion Alignment  
**Dual-Path Anchoring for High-Fidelity Text-to-Motion Diffusion**

LUMA turns a plain-English prompt into smooth, realistic 3-D human motion — and it does so **faster and more accurately** than previous diffusion models.

## ✨ Why LUMA?

| Innovation | What it means for you |
|------------|-----------------------|
| **Dual-Path Anchors** | Combines a *temporal* anchor (lightweight MoCLIP) with a *frequency* anchor (low-freq DCT) to chase both semantics *and* kinematics in one shot. |
| **Timestep-Aware FiLM Modulation** | Injects these anchors adaptively through the denoising steps, revitalising deep-layer gradients and curing semantic drift. |
| **State-of-the-Art Quality** | Delivers record-low FID scores (0.035 on HumanML-3D, 0.123 on KIT-ML). |
| **1.4 × Faster Convergence** | Hits target quality 35k steps vs. 50k for the baseline, saving hours of GPU time. |

## 🚀 What can I use it for?

* Animate characters in games, VR/AR, or film from a single sentence.  
* Prototype robot motions directly from natural-language instructions.  
* Research new control or editing techniques on a strong, open backbone.

## 🔧 Environment Setup

This code was tested on `NVIDIA A100` and requires:

* conda3 or miniconda3
* python 3.8+
* pytorch 1.10+

### Step 1: Create conda environment

```bash
conda create -n luma python=3.8 -y
conda activate luma
```

### Step 2: Install PyTorch

```bash
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge 
```

**Important:** Make sure that your compilation CUDA version and runtime CUDA version match.

### Step 3: Install other requirements

```bash
pip install -r requirements.txt
```

### Step 4: Install ffmpeg for visualization

```bash
conda install ffmpeg x264=20131218 -c conda-forge
```

### Step 5: Modify LayerNorm for fp16 inference

```python
# miniconda3/envs/luma/lib/python3.8/site-packages/clip/model.py
class LayerNorm(nn.LayerNorm):
    """Subclass torch's LayerNorm to handle fp16."""

    def forward(self, x: torch.Tensor):
        if self.weight.dtype==torch.float32:
            orig_type = x.dtype
            ret = super().forward(x.type(torch.float32)) 
            return ret.type(orig_type)  
        else:
             return super().forward(x)
```

## 📊 Dataset Preparation

### HumanML3D Dataset

Follow the instructions in [HumanML3D](https://github.com/EricGuo5513/HumanML3D.git), then copy the result dataset to our repository:

```bash
cp -r ../HumanML3D/HumanML3D ./data/HumanML3D
```

### KIT-ML Dataset

Download from [HumanML3D](https://github.com/EricGuo5513/HumanML3D.git) (no processing needed) and place the result in `./data/KIT-ML`

Expected data structure:
```text
LUMA
└── data
    ├── HumanML3D
    │   ├── new_joint_vecs
    │   ├── new_joints
    │   ├── texts
    │   ├── Mean.npy
    │   ├── Std.npy
    │   ├── test.txt
    │   ├── train_val.txt
    │   ├── train.txt
    │   └── val.txt
    └── KIT-ML
        ├── new_joint_vecs
        ├── new_joints
        ├── texts
        ├── Mean.npy
        ├── Std.npy
        ├── test.txt
        ├── train_val.txt
        ├── train.txt
        └── val.txt
```

## 🎯 Training

### Train on HumanML3D

```bash
python -m scripts.train --name luma_humanml3d --model-ema --dataset_name t2m
```

### Train on KIT-ML

```bash
python -m scripts.train --name luma_kit --model-ema --dataset_name kit
```

You may also define the `--config_file` for training on multi GPUs.

### Training Arguments

* `--name`: Experiment name for saving checkpoints
* `--model-ema`: Enable exponential moving average for model weights
* `--dataset_name`: Choose dataset (`t2m` for HumanML3D, `kit` for KIT-ML)
* `--lambda_fre`: Weight for frequency semantic anchor loss (default: 0.2)
* `--lambda_tem`: Weight for temporal semantic anchor loss (default: 0.3)
* `--decay_threshold`: Steps for cosine annealing schedule (default: 50000)

## 🧪 Testing & Evaluation

### Generate motions from text prompts

```bash
# Generate from a single prompt
python -m scripts.generate --text_prompt "a man walks in a circle" --motion_length 4 --opt_path ./checkpoints/t2m/luma_humanml3d/opt.txt

# Generate from text file
python -m scripts.generate --input_text ./assets/prompts.txt --motion_length 4 --opt_path ./checkpoints/t2m/luma_humanml3d/opt.txt

# Generate from test set prompts
python -m scripts.generate --num_samples 10 --opt_path ./checkpoints/t2m/luma_humanml3d/opt.txt
```

### Evaluation

```bash
# Evaluate on HumanML3D
python -m scripts.evaluation --opt_path ./checkpoints/t2m/luma_humanml3d/opt.txt

# Evaluate on KIT-ML
python -m scripts.evaluation --opt_path ./checkpoints/kit/luma_kit/opt.txt
```

### Generation Arguments

* `--text_prompt`: Single text prompt for motion generation
* `--motion_length`: Motion length in seconds
* `--input_text`: Path to text file with multiple prompts
* `--num_samples`: Number of samples to generate from test set
* `--device`: GPU device ID
* `--diffuser_name`: Sampler type (`ddpm`, `ddim`, `dpmsolver`)
* `--num_inference_steps`: Number of denoising steps during inference
* `--seed`: Random seed for reproducible results

### Output

Generated motions will be saved as:
* `output_dir/joints_npy/xx.npy` - xyz pose sequence
* `output_dir/xx.mp4` - visual animation

Output directory is located in the checkpoint folder like `checkpoints/t2m/luma_humanml3d/samples_*/`.

## 🧠 MoCLIP Training (Optional)

LUMA uses MoCLIP (Motion-aware CLIP) for better text-motion alignment. You can optionally train your own MoCLIP model or use our pre-trained ones.

### Train MoCLIP on HumanML3D

```bash
python train_moclip.py --dataset_name t2m --exp_name moclip_humanml3d --batch_size 32 --num_epochs 30
```

### Train MoCLIP on KIT-ML

```bash
python train_kit_moclip.py --exp_name moclip_kit --batch_size 32 --num_epochs 50
```

### MoCLIP Training Arguments

* `--exp_name`: Experiment name for saving checkpoints
* `--dataset_name`: Dataset name (`t2m` for HumanML3D, `kit` for KIT-ML)
* `--batch_size`: Training batch size
* `--learning_rate`: Learning rate (default: 1e-4)
* `--num_epochs`: Number of training epochs
* `--freeze_clip`: Freeze CLIP parameters for faster training
* `--input_dim`: Motion input dimension (263 for T2M, 251 for KIT)
* `--embed_dim`: Embedding dimension (default: 768)
* `--temperature`: Contrastive learning temperature (default: 0.07)

### Pre-trained MoCLIP Models

If you don't want to train from scratch, you can use our pre-trained MoCLIP models:

* **HumanML3D**: `./checkpoints/moclip_training/best_model.pt`
* **KIT-ML**: `./checkpoints/moclip_kit_training/best_model.pt`

These models are automatically loaded when training LUMA with the `--moclip_model_path auto` argument.
