## UOT-WFM — README

This repository is a heavily modified fork/derivative of TorchCFM. The previous README is for reference only; this document is the authoritative guide for the updated structure and usage.

Authors: Anonymous

## What’s New

- Examples
  - Added 1D examples: `examples/1D_examples`
  - Updated some 2D tutorials: `examples/2D_tutorials`
  - Major revisions to image/CIFAR workflows: `examples/images/cifar10`
- Library core changes
  - Large edits in `torchcfm/conditional_flow_matching.py` (training/sampling logic)
  - Large edits in `torchcfm/optimal_transport.py` (Sinkhorn/OT options)
  - New `torchcfm/logging_utils.py` (logging/plotting utilities)
  - Large edits in `torchcfm/utils.py` (data, normalization, experiment naming, etc.)


## Project Structure (Detailed)

```
uotwfm/
├── examples/
│   ├── 1D_examples/
│   │   ├── FM-visualization.ipynb
│   │   └── FM-visualization_refact.ipynb
│   ├── 2D_tutorials/
│   │   ├── tutorial_training_8_gaussians_to_moons.ipynb
│   │   ├── The_unreasonable_performance_of_minibatch_OT.ipynb
│   │   ├── Maximum_likelihood_CNF_tutorial.ipynb
│   │   ├── Flow_matching_tutorial.ipynb
│   │   ├── SF2M_tutorial.ipynb
│   │   ├── Majority_test.ipynb
│   │   ├── preprocessing/
│   │   ├── majority_test.py
│   │   ├── run_majority_test.sh
│   │   └── run_majority_test2.sh
│   └── images/
│       ├── mnist_example.ipynb
│       ├── conditional_mnist.ipynb
│       └── cifar10/
│           ├── train_cifar10.py               # CIFAR training entrypoint
│           ├── train_cifar10_ddp.py           # DDP training entrypoint
│           ├── compute_eval.py                # Evaluation (FID/PR/PCA) computation
│           ├── utils_cifar.py                 # CIFAR utilities (EMA, sampling, plotting,...)
│           ├── visualize_distribution.py      # data distribution helpers
│           ├── data/                          # CIFAR data root (default ./data)
│           ├── results/                       # checkpoints/logs/samples
│           ├── exp0.sh ... exp3.sh            # example launch scripts
│           └── README.md
├── torchcfm/
│   ├── conditional_flow_matching.py           # core FM classes (ICFM, OT-CFM, Sinkhorn variants,...)
│   ├── optimal_transport.py                   # OT/Sinkhorn utilities
│   ├── logging_utils.py                       # logging/plotting utilities
│   ├── utils.py                               # data, normalization, experiment naming, helpers
│   ├── models/
│   │   ├── models.py                          # small 2D models
│   │   └── unet/
│   │       ├── unet.py                        # UNet model
│   │       ├── nn.py                          # layers/blocks
│   │       ├── fp16_util.py                   # mixed precision helpers
│   │       └── logger.py                      # simple logger
│   ├── __init__.py
│   └── version.py
├── runner/
│   ├── configs/                               # (legacy) lightning/runner configs
│   ├── scripts/
│   ├── src/
│   ├── tests/
│   └── README.md
├── tests/
│   ├── test_conditional_flow_matcher.py
│   ├── test_optimal_transport.py
│   ├── test_models.py
│   └── test_time_t.py
├── environment.yaml                           # conda/pip deps (consider adding clean-fid)
├── pyproject.toml / setup.py                  # packaging
└── README.md
```


## Setup

```bash
# Create conda env (example)
conda env create -f environment.yaml
conda activate torchcfm
```

If multiple GPUs are visible, Clean-FID may internally enable torch.nn.DataParallel and trigger NCCL issues on some systems. Prefer exposing a single GPU via `CUDA_VISIBLE_DEVICES` during FID computation.


## CIFAR10 Training — `examples/images/cifar10/train_cifar10.py`

Key flags

- Model/output
  - `--model` one of: `otcfm | sinkhorn_otwfm | sinkhorn_otcfm | sinkhorn_otwfm_dv | icfm | itfm | si`
  - `--output_dir` default `./results/`
  - `--device` default `cuda:0` (when exposing a single GPU, keep `cuda:0`)
  - `--parallel` enable DataParallel (recommend False)
- Optimization
  - `--lr` (default 2e-4), `--grad_clip` (default 1.0)
  - `--total_steps` (default 400001), `--warmup` (default 5000)
  - `--batch_size` (default 128), `--num_workers` (default 32)
  - `--ema_decay` (default 0.9999)
  - `--resume_step` checkpoint resume step (0 for fresh start)
  - `--save_step` checkpoint/logging frequency (0 disables periodic save)
- Dataset/normalization
  - `--dataset_name` one of: `cifar10 | cifar10_lt | cifar100_lt | cifar10_lt_regacy`
  - `--data_root` default None → internally set to `./data`
  - `--data_norm` one of: `adaptive | default | cifar10 | cifar100 | cifar10_lt | cifar100_lt`
    - adaptive: compute mean/std from the dataset and normalize to zero-mean/unit-std
    - default: fixed mean/std `(0.5, 0.5, 0.5)`
- OT/weighting (for OT-CFM/Sinkhorn variants)
  - `--method` one of: `unbalanced_knopp | unbalanced`
  - `--reg`, `--tau_b`, `--normalize_cost`
  - `--recoupling`, `--fixed_source`, `--fixed_target`
  - `--weight_type` one of: `none | inv_tnu`
  - `--weight_power_factor`
  - `--efm`, `--beta` (energy-weighted flow matching)
- Architecture
  - `--num_channel` (default 128), other UNet specifics are defined in code

Example

```bash
# Train on a single physical GPU (id=2) → local index becomes 0
CUDA_VISIBLE_DEVICES=2 \
python examples/images/cifar10/train_cifar10.py \
  --dataset_name cifar10_lt \
  --model sinkhorn_otwfm \
  --method unbalanced_knopp \
  --reg 0.05 --tau_b 1.0 \
  --weight_type inv_tnu --weight_power_factor 0.7 \
  --fixed_source True \
  --batch_size 128 \
  --device cuda:0
```


## Evaluation — `examples/images/cifar10/compute_eval.py`

Key flags

- Mode
  - `--measure_fid` (default True)
  - `--measure_precall` (precision/recall evaluation)
  - `--measure_likelihood` (per-sample CNF log-likelihood evaluation → CSV)
  - `--gen_external_path`, `--data_external_path` (use external dirs for generated/real images)
  - `--method_pr` one of: `fast | slow`
- Sampling/integration
  - `--integration_method` (e.g., `dopri5`, `euler`), `--integration_steps`, `--tol`
  - `--num_gen` (samples to generate), `--batch_size_fid`
  - `--device` default `cuda:0`
  - Likelihood-specific: `--batch_size_ll`, `--trace_estimator_ll [hutch|exact]`, `--ll_max_samples`, `--ll_output_csv`
- Checkpoints/paths
  - `--input_dir` default `./results`
  - `--directory` one of: `none | auto | <manual-subdir>`
    - none: path built with `--model` and `--training_params`
    - auto: use `exp_naming(FLAGS)` to compute subdir
  - `--model`, `--training_params`, `--step` participate in filename
- Model/OT (subset shared with training)
  - `--dataset_name`, `--weight_type`, `--reg`, `--tau_b`, `--efm`, `--beta`, `--weight_power_factor`, `--parallel`, `--recoupling`, `--fixed_source`, `--fixed_target`

Checkpoint resolution rules

- `directory == auto` → `./results/<exp_naming(FLAGS)>/<model>_<dataset_name>_weights_step_<step>.pt`
- `directory == none` and `training_params != None` → `./results/<model>_<training_params>/<model>_<dataset_name>_weights_step_<step>.pt`
- else → `./results/<directory>/<model>_<dataset_name>_weights_step_<step>.pt`

Examples

```bash
# FID on a single visible GPU with smaller batch size for memory headroom
# ! Using CUDA_VISIBLE_DEVICES is highly recommanded for multiGPU environment !
CUDA_VISIBLE_DEVICES=0 \
python examples/images/cifar10/compute_eval.py \
  --dataset_name cifar10_lt \
  --model sinkhorn_otwfm \
  --directory auto \
  --weight_type inv_tnu --reg 0.05 --tau_b 1.0 \
  --weight_power_factor 0.7 --fixed_source True \
  --batch_size_fid 256 \
  --device cuda:0

# Compare two external folders (generated vs real)
CUDA_VISIBLE_DEVICES=0 \
python examples/images/cifar10/compute_eval.py \
  --gen_external_path /path/to/generated \
  --data_external_path /path/to/real \
  --num_gen 50000 --batch_size_fid 256 \
  --device cuda:0
```

### Pointwise CNF Likelihood

Definition

- For a continuous flow with vector field \(u_t(x)\) and base density \(p_0\) (standard normal), the log-density along trajectories satisfies
  \[ \frac{d}{dt} \log p_t(x_t) = -\mathrm{div}_x\, u_t(x_t). \]
  For a data point \(x_1\), integrating from \(t=1\) to \(0\) with the augmented ODE
  \(\dot x = u_t(x),\; \dot s = -\mathrm{div}\,u_t(x)\), gives
  \[ \log p_1(x_1) = \log p_0(x_0) - \int_0^1 \mathrm{div}\, u_t(x_t)\,dt, \]
  where \(x_0\) is the mapped base sample at \(t=0\).

How it is computed here

- We reuse the trained `UNetModelWrapper` as \(u_t\).
- We augment state `[s, x]` and integrate. Two integrators are available:
  - Manual Euler (default): memory-friendly, per-step graph release
  - `odeint` (adaptive): higher cost/memory
- Time direction and integrand sign are consistent with the definition of \(s = \int_0^1 \mathrm{div}\,u_t(x_t)\,dt\):
  - Backward (default): integrate \(t:1\to0\) with \(\dot s = -\mathrm{div}\,u\)
  - Or Forward: integrate \(t:0\to1\) with \(\dot s = +\mathrm{div}\,u\)
- Divergence is estimated by default with Hutchinson’s estimator (Rademacher noise, optional MC averaging). `--trace_estimator_ll=exact` does exact autograd trace (very slow).
- After integration, for each sample: `log p0(x0)` is computed under standard normal in R^(3×32×32), and `log p1 = log p0 - s` is written to CSV.

Usage

```bash
CUDA_VISIBLE_DEVICES=0 \
python examples/images/cifar10/compute_eval.py \
  --dataset_measure=cifar10,cifar10_lt \
  --measure_likelihood=True \
  --likelihood_split=train \
  --batch_size_ll=64 \
  --trace_estimator_ll=hutch \
  --ll_max_samples=5000 \
  --data_norm=default \
  --ll_manual_euler=True \
  --ll_euler_steps=20 \
  --ll_time_direction=backward \
  --ll_midpoint=False \
  --ll_trace_noise=rademacher \
  --ll_trace_mc=1 \
  --device=cuda:0 \
  --directory=auto \
  --model=sinkhorn_otwfm \
  --dataset_name=cifar10 \
  --step=400000
```

Outputs

- For each dataset in `--dataset_measure`, a CSV is produced in the run’s results folder (or at `--ll_output_csv`):
  `ll_<dataset>_step_<step>.csv` with columns: `index,label,loglik`.

Notes

- Keep `--data_norm` consistent with the training run; otherwise, likelihood is not comparable.
- `hutch` is recommended. `exact` is prohibitively slow for 3×32×32.
- If runtime/memory is high, reduce `--batch_size_ll`, relax `--tol`, or cap samples via `--ll_max_samples`.

Caveats/tips

- If multiple GPUs are visible, Clean-FID may wrap its feature extractor with `DataParallel`, which can trigger NCCL issues. Expose a single GPU and keep `--device cuda:0`.
- “invalid device ordinal” means your local device index does not exist; when using `CUDA_VISIBLE_DEVICES=<phys-id>`, always run with `--device cuda:0`.


## Troubleshooting

- NCCL errors (e.g., `unhandled cuda error`): expose a single GPU, reduce `--batch_size_fid`, optionally `export NCCL_P2P_DISABLE=1; export NCCL_IB_DISABLE=1`.
- Clean-FID stats missing: run `cleanfid download` once.
- Dataset paths: when `--data_root` is absent, CIFAR is downloaded under `./data`.


## License and Credits

- This derivative work is distributed under the MIT License. The original license and notices are preserved; modifications are documented here.
- Original notice: Portions © 2023 Original Authors, used under the MIT License.
- Changes and additions: © 2025 Anonymous Contributors, licensed under MIT.


