# AugMask (share)

Public-facing subset of AugMask: training and evaluation for **CDTD**, **TabDiff**, and **TabDDPM** with strategy 0 (aug_full) and 2 (aug_mask).

## Setup

**Environments**

- **augmask**: training and sampling  
  `conda env create -f augmask.yml -n augmask`
- **synthcity**: evaluation  
  `conda env create -f evaluator.yml -n synthcity`

(Use the `augmask.yml` and `evaluator.yml` from the parent AugMask repo if not included here.)

**Data**

1. All data lives under **`data/`** (see `data/README.md`). Put preprocessed inputs in `data/<dataname>/` (for evaluator: info.json, real.csv, test.csv, val.csv) and `data/<dataname>_<dt>/<pattern>_p<p>_<seed>/` (for training: info.json and .npy files). Generated samples go to `data/synthetic/`.
2. Use the parent AugMask README / QUICK_START to create these from raw datasets.

## Run

From **AugMask_share** root:

```bash
./run_pipeline.sh <gpu_id>
```

This runs the full pipeline (train 30000 steps → sample → evaluate) for each preproc: m, r, LGB_D, LGB_S, miceforest, zero, noise. Edit variables at the top of `run_pipeline.sh` to change dataset, ratio, or preprocs.

**Supported preprocs:** `LGB_D`, `LGB_S`, `r`, `m`, `miceforest`, `zero`, `noise`. Set `PREPROC` in the script to one of these.

**Single run (from `augmask/`):**

```bash
cd augmask
conda activate augmask
python main.py --model cdtd --mode train --cfg_path=configs/cdtd/default_bytype.yaml --exp_path= --data adult --cov both --p 0.5 --gpu 0 --preproc m --extra aug_mask --strategy 2 --noise_seed 2 --pattern NU1
```

## Layout

- **augmask/** – Training: `main.py`, diffusion code, data prep (`data/`), experiments (CDTD, TabDiff, TabDDPM), configs, `experiments/models/` (layers, tabddpm, tabdiff). Checkpoints and logs: **augmask/results/<run_name>/** (`model.pt`, `ema_model.pt`, etc.).
- **data/** – All data: `data/<dataname>/` (info.json, real.csv, test.csv, val.csv for evaluator), `data/<dataname>_<dt>/<pattern>_p<p>_<seed>/` (training .npy + info.json), `data/synthetic/` (generated samples).
- **evaluation/** – Post-hoc evaluator (synthcity env): `evaluator.py`, `metrics.py`, `eval/`. Reads from **data/** and writes to `eval/report_runs/`.

## Models

- **cdtd** – Continuous diffusion for tabular data (mixed-type, timewarp, calibrate_losses).
- **tabdiff** – TabDiff (continuous-time masked diffusion).
- **tabddpm** – TabDDPM (Gaussian–multinomial diffusion).

Strategies: `0` = aug_full (loss on all features), `2` = aug_mask (loss only on observed features).
