# TWNM · The World is Not Mono

TWNM is an open-source research project that unifies speech, spatial audio, and music understanding under a single generative model. The system couples a frozen Whisper encoder, a spatial-aware encoder, and a Qwen2 family decoder with mixture-of-experts routing. This repository now follows an open-source friendly layout so the community can explore, reproduce, and extend the work.

## Repository Layout

```
TWNM/
├── README.md
├── docs/                     # Project documentation
│   └── architecture.md       # Detailed architecture notes
├── configs/                  # Training & inference YAML configs
│   └── deepspeed/            # Deepspeed launch profiles
├── src/                      # Python package (install with `pip -e .` or add to PYTHONPATH)
│   └── twnm/
│       ├── models/           # Core model definitions (TWNM, configs, spatial encoder…)
│       ├── data/             # Dataset helpers (jsonl readers, collate functions…)
│       ├── rl/               # GRPO trainer & reward utilities
│       ├── utils/            # Generic metrics & evaluation utilities
│       └── tools.py          # Legacy helper functions kept for compatibility
├── scripts/                  # CLI entrypoints grouped by use case
│   ├── train/                # Supervised, LoRA and GRPO training drivers
│   ├── infer/                # One-off and batch inference scripts
│   ├── eval/                 # Evaluation pipelines for MMAU / SpatialQA
│   ├── debug/                # Smoke tests and debugging helpers
│   ├── tools/                # Auxiliary export / analysis utilities
│   └── launch/               # Example shell launchers
├── datasets/                 # Sample datasets for quick-start experimentation
│   ├── raw_data/             # Example QA jsonl files & audio clips
│   └── mmau/                 # MMAU mini sets used in evaluation
├── assets/                   # Large binaries kept out of the Python package
│   ├── checkpoints/          # Spatial encoder, merged models, etc.
│   └── audio/                # Demo audio lists
├── outputs/                  # Logs and result artifacts (ignored during packaging)
│   ├── logs/
│   └── results/
├── tools/benchmark/          # Benchmark generation utilities (LLM prompts, scripts)
└── notebooks/                # Interactive exploration notebooks
```

## Usage Guide (English)

### Environment preparation

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export PYTHONPATH=$(pwd)/src:$PYTHONPATH
```

- The repository ships a TorchScript spatial encoder at `assets/checkpoints/spatial_encoder/spatial_encoder.ts`; no proprietary code is required for inference.
- If you own a private implementation, set `TWNM_SPATIAL_ENCODER_MODULE=<dotted.module.path>` before training.

### Inference

```bash
python scripts/infer/run_inference.py \
  --config_path configs/inference.yaml \
  --checkpoint_path assets/checkpoints/sft2_checkpoint-2502/pytorch_model.bin \
  --wav_path datasets/raw_data/qa_data/audio/scene_000000.wav \
  --prompt "Describe the soundscape in rich detail."
```

- GRPO policy inference (LoRA adapter):

```bash
python scripts/infer/run_inference_grpo.py \
  --model_path assets/checkpoints/sft2_checkpoint-2502/pytorch_model.bin \
  --policy_adapter_path assets/checkpoints/grpo_checkpoint-1/policy \
  --wav_path datasets/raw_data/qa_data/audio/scene_000001.wav \
  --prompt "Please answer the question in the audio."
```

- Batch inference will traverse all WAV files inside a folder:

```bash
python scripts/infer/run_inference_batch.py \
  --input_list datasets/raw_data/qa_data/audio \
  --output_file outputs/results/batch_inference.txt
```

### Training pipeline

1. **Stage 1 – SFT with MoE (Alignment I)**  
   Trains the router/experts while Whisper & spatial encoder stay frozen.

   ```bash
   python -m torch.distributed.run --nproc_per_node=1 scripts/train/main_train.py \
     --config_path configs/train.yaml \
     --data_dir datasets/raw_data/qa_data \
     --out_dir outputs/experiments/sft_stage1
   ```

   The dataset must contain `router_label` for each sample (see `datasets/raw_data/qa_data/train.jsonl` example).

2. **Stage 2 – SFT refinement (Alignment II)**  
   Fine-tunes LoRA adapters on instruction data.

   ```bash
   python -m torch.distributed.run --nproc_per_node=1 scripts/train/train_sft2.py \
     --config_path configs/train_sft2.yaml \
     --data_dir datasets/raw_data/qa_data \
     --out_dir outputs/experiments/sft_stage2 \
     --init_model_path assets/checkpoints/sft_checkpoint-71139/pytorch_model.bin
   ```

3. **Stage 3 – GRPO**  
   Reinforcement learning with custom rewards (`src/twnm/rl/rewards.py`):

   ```bash
   python scripts/train/train_grpo.py \
     --data_file datasets/raw_data/qa_data/train.jsonl \
     --output_dir outputs/experiments/grpo_stage \
     --learning_rate 5e-6 \
     --num_generations 4 \
     --max_prompt_length 512 \
     --max_completion_length 128
   ```

   After training, the LoRA policy adapter is saved to `<output_dir>/lora_policy_adapter/`.

See `scripts/launch/` for multi-GPU examples (Deepspeed configs under `configs/deepspeed/`).

### Evaluation

```bash
python scripts/eval/evaluate_mmau.py \
  --sft_checkpoint_path assets/checkpoints/sft2_checkpoint-2502/pytorch_model.bin \
  --mmau_json_path datasets/mmau/mmau-test-mini.json \
  --output_file outputs/results/mmau_eval.jsonl
```

Spatial QA / GRPO evaluation scripts in `scripts/eval/` follow a similar interface.

---

## 使用指南（中文）

### 环境准备

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export PYTHONPATH=$(pwd)/src:$PYTHONPATH
```

- 仓库默认提供 `assets/checkpoints/spatial_encoder/spatial_encoder.ts`，无需专利代码即可推理。
- 如果你拥有私有的空间编码器实现，可设置 `TWNM_SPATIAL_ENCODER_MODULE=<模块路径>` 以覆盖默认 TorchScript。

### 推理

```bash
python scripts/infer/run_inference.py \
  --config_path configs/inference.yaml \
  --checkpoint_path assets/checkpoints/sft2_checkpoint-2502/pytorch_model.bin \
  --wav_path datasets/raw_data/qa_data/audio/scene_000000.wav \
  --prompt "请详细描述这段音频。"
```

- 使用 GRPO 策略头：

```bash
python scripts/infer/run_inference_grpo.py \
  --model_path assets/checkpoints/sft2_checkpoint-2502/pytorch_model.bin \
  --policy_adapter_path assets/checkpoints/grpo_checkpoint-1/policy \
  --wav_path datasets/raw_data/qa_data/audio/scene_000001.wav \
  --prompt "请回答音频中的问题。"
```

- 批量推理：

```bash
python scripts/infer/run_inference_batch.py \
  --input_list datasets/raw_data/qa_data/audio \
  --output_file outputs/results/batch_inference.txt
```

### 训练流程

1. **阶段一（SFT + MoE 路由训练）**

```bash
python -m torch.distributed.run --nproc_per_node=1 scripts/train/main_train.py \
  --config_path configs/train.yaml \
  --data_dir datasets/raw_data/qa_data \
  --out_dir outputs/experiments/sft_stage1
```

数据需包含 `router_label` 字段，以监督专家路由。

2. **阶段二（SFT 二阶微调 / LoRA）**

```bash
python -m torch.distributed.run --nproc_per_node=1 scripts/train/train_sft2.py \
  --config_path configs/train_sft2.yaml \
  --data_dir datasets/raw_data/qa_data \
  --out_dir outputs/experiments/sft_stage2 \
  --init_model_path assets/checkpoints/sft_checkpoint-71139/pytorch_model.bin
```

3. **阶段三（GRPO 强化学习）**

```bash
python scripts/train/train_grpo.py \
  --data_file datasets/raw_data/qa_data/train.jsonl \
  --output_dir outputs/experiments/grpo_stage \
  --learning_rate 5e-6 \
  --num_generations 4 \
  --max_prompt_length 512 \
  --max_completion_length 128
```

训练结束后，策略 LoRA 保存在 `<output_dir>/lora_policy_adapter/`。

### 评测

```bash
python scripts/eval/evaluate_mmau.py \
  --sft_checkpoint_path assets/checkpoints/sft2_checkpoint-2502/pytorch_model.bin \
  --mmau_json_path datasets/mmau/mmau-test-mini.json \
  --output_file outputs/results/mmau_eval.jsonl
```

`scripts/eval/` 下还提供了空间问答与 GRPO 模型的评测脚本，可根据需要调用。

### Available checkpoints

- `assets/checkpoints/spatial_encoder/`
  - `loss=0.4612.ckpt`: 原始空间编码器训练权重
  - `spatial_encoder.ts`: TorchScript 推理模型（默认加载）
- `assets/checkpoints/sft_checkpoint-71139/`: 第一阶段 SFT 权重
- `assets/checkpoints/sft2_checkpoint-2502/`: 第二阶段 SFT/LoRA 合成权重（推荐用于推理）
- `assets/checkpoints/grpo_checkpoint-1/`: GRPO policy adapters 及对应 tokenizer 资源

### 更新空间编码器 TorchScript

如需重新导出专有空间编码器的 TorchScript 版本（例如在内网环境有源码实现时），可运行：

```bash
python scripts/tools/export_spatial_encoder_torchscript.py \
  --ckpt assets/checkpoints/spatial_encoder/loss=0.4612.ckpt \
  --output assets/checkpoints/spatial_encoder/spatial_encoder.ts
```

命令依赖私有实现 `private_impl/spatial_encoder_impl`，仅供内部使用。外部用户无需执行该步骤。


## Project Status

- ✅ Repository flattened into a modern Python package (`src/twnm`).
- ✅ Legacy `USAM` naming removed; TWNM now stands for **The World is Not Mono**.
- ✅ Unused modules (CED, LPS, contrastive loss) pruned.
- ⚠️ Spatial encoder实现默认以 TorchScript 封装 (`assets/checkpoints/spatial_encoder/spatial_encoder.ts`) 提供，仅包含推理所需；如需替换为自研版本，请设置 `TWNM_SPATIAL_ENCODER_MODULE` 或引入私有实现。
- ✅ Artifacts, configuration, and scripts grouped for clarity.
- 🔄 Upcoming tasks: consolidate duplicated GRPO trainers, add automated tests, and release environment requirements.

## Contributing

Contributions are welcome! Please open an issue or pull request with proposed changes. For major updates, start a discussion so we can align on design choices (e.g. additional tasks, dataset formats, or new routing strategies).

## Citation

If you build on TWNM for academic work, please cite the repository and describe that **The World is Not Mono** architecture couples Whisper, a spatial encoder, and Qwen2-based MoE decoding.
