# Data directory

All inputs and outputs for training and evaluation live under `data/`.

## Layout

- **`data/<dataname>/`** – For evaluator: `info.json`, `real.csv`, `test.csv`, `val.csv` (real data splits; evaluator compares generated samples to these).
- **`data/<dataname>_semi_xy/<pattern>_p<p>_<seed>/`** – Preprocessed training data. Keep one or two such folders. Default pipeline uses `adult_semi_xy/NU1_p0.5_2/`. Each must contain:
  - `info.json`
  - `X_num_train.npy`, `X_num_test.npy`, `X_cat_train.npy`, `X_cat_test.npy`, `y_train.npy`, `y_test.npy`
  - `mask_num_train.npy`, `mask_num_test.npy`, `mask_cat_train.npy`, `mask_cat_test.npy`, `y_mask_train.npy`, `y_mask_test.npy`
- **`data/synthetic/<method>-<extra>/<dataname>/`** – Generated samples (CSV) written by training and read by evaluator. Created at run time; can be gitignored.

## Example: adult

1. **Evaluator inputs** – `data/adult/` with `info.json`, `real.csv`, `test.csv`, `val.csv` (same schema as training `info.json`).

2. **Training inputs** – One or two folders under `data/adult_semi_xy/`. The default `run_pipeline.sh` expects `data/adult_semi_xy/NU1_p0.5_2/` (pattern NU1, p=0.5, seed=2). Use the parent AugMask repo’s data prep or `data/create_minimal_preprocessed.py` (requires `data/adult/` with .npy already).

3. **Generated samples** – Training writes to `data/synthetic/<method>-<extra>/adult/`. Evaluator reads those CSVs and `data/adult/` for metrics.

## Run

From **AugMask_share** root:

```bash
./run_pipeline.sh <gpu_id>
```

For evaluation only, ensure `data/adult/` has `real.csv`, `test.csv`, `val.csv`, and `info.json`. For training, ensure `data/adult_semi_xy/NU1_p0.5_2/` (or the pattern/p/seed set in `run_pipeline.sh`) exists with the required `.npy` files; see the parent AugMask repo or `data/create_minimal_preprocessed.py`.
