
# SkillMoE Demo (Whitened Q + optional parameter-level adapters + optional residual)

This package contains a minimal, *correct* implementation of the **whitened decoder** SkillMoE with:
- toggles for **parameter-level MoE** (adapters in the reverse kernels for the coefficient diffusion),
- a toggle for an **orthogonal-residual** term `R r`,
- a **standard diffusion baseline** in action space,
- and a **multitask, multi-dimensional synthetic test with shared subsequences**.

## What you get

- **ops.py** — QR-retracted orthonormal decoder `Q` and orthonormal complement `R`.
- **models.py** — coefficient denoiser with **LoRA-style adapters** mixed by the gate; residual denoiser; gating networks; baseline action-space denoiser.
- **losses.py** — cosine diffusion scheduler; **Dirichlet–Dirichlet KL** in closed form.
- **data.py** — multi-task toy dataset with **shared subsequences** (reused phase templates) and **sticky gates**.
- **train_eval.py** — trains the **correct ELBO surrogate**: diffusion losses on `z` (and `r` if toggled) + Dirichlet KLs for sticky gates + tiny orthogonality regularizer.
- **run_ablation.py** — runs the ablations:
  1) **baseline** (action diffusion),
  2) **Q_only** (no adapters, no residual),
  3) **Q_residual** (no adapters, with residual),
  4) **Q_adapters** (adapters, no residual),
  5) **Q_adapters_res** (adapters with residual).

## Run

```bash
python run_ablation.py
```

Now uses train/val/test splits and reports test metrics per method:
- baseline: `mse_action_denoised`
- SkillMoE variants: `mse_action_proj`, `mse_action_denoised`, `gate_kl_true`, `switch_rate`

Tips
- Tune training via `run_ablation.py` config: `epochs`, `lr`, `weight_decay`, `lambda_gate_align`, etc.
- Progress bar: set `progress_bar` to True/False in config.
- Sanity checks: set `sanity_every` to an integer N to print validation metrics every N epochs (0 disables).
- On Apple Silicon, MPS is used automatically if available.

## Notes

- The code uses a **projection from actions to coefficients** `z` via `z = (Q^T a) / (g + ε)` to form supervised targets for coefficient diffusion. This matches the *coefficient-space diffusion* story with whitened `Q`. 
- The **adapters** make the reverse kernel depend on `g`; when `use_adapters=False`, the kernels are shared.
- The **Dirichlet KLs** implement the sticky gate prior and anti-degeneration (global usage) within the ELBO.
