# DVM-AD Bundle (47 Tabular + 10 NLP/CV + Synthetic)

This folder is a self-contained bundle for running the DVM-AD pipelines:
- **Tabular real-world**: 47 datasets
- **NLP/CV**: 10 aggregated groups for DVM-AD (baselines run on all sub-datasets and aggregated later) 
- **Synthetic**: 4 types (cluster/global/dependency/local)

All paths below are **relative to this `DVM-AD/` folder**. Run from here.

## Folder Structure

- `Model/`: DVM-AD + baseline model implementations
- `scripts/`: run/process/plot scripts
  - `scripts/tabular/`: tabular run + processing
  - `scripts/nlp_cv/`: NLP/CV run + processing
  - `scripts/synthetic_data/`: synthetic data generator
  - `scripts/drawing/`: plots/tables
- `process_tabular/`: tabular raw + processed outputs
- `process_nlp_cv/`: NLP/CV raw + processed outputs
- `final_results/`: final CSVs, rankings, figures, tables
- `Data/`: dataset inputs (you must place data files here)

## Data Layout (required)

Place datasets under `Data/`:

- **Tabular (47 real-world):**
  - `Data/Classical/<dataset>.npz`
- **NLP (BERT features):**
  - `Data/NLP_by_BERT/<dataset>.npz`
- **CV (ResNet18 features):**
  - `Data/CV_by_ResNet18/<dataset>.npz`
- **Synthetic (generated):**
  - `Data/Synthetic_Datasets/<type>_outliers_datasets/<dataset>_X.csv`
  - `Data/Synthetic_Datasets/<type>_outliers_datasets/<dataset>_y.csv`

## Pipeline A — Tabular Real-World (47 datasets)

1. Run DVM-AD:
   - `python scripts/tabular/run/run_dvmad_realworld_data.py`
2. Run baselines:
   - `python scripts/tabular/run/run_baseline_realworld_data.py`
3. Select best per dataset:
   - `python scripts/tabular/process/Handle_data.py`
4. Copy final 47 CSVs:
   - `python scripts/tabular/process/build_final_47.py`
5. Rank results:
   - `python scripts/tabular/process/Ranking.py`

Outputs:
- Raw: `process_tabular/raw_data/dvmad_result_realworld_data.csv`
- Baseline: `process_tabular/raw_data/Baseline_result_realworld_data.csv`
- Best-per-dataset: `process_tabular/processed_results/dvmad_result_realworld_data.csv`
- Final: `final_results/results/*_tabular_47_datasets.csv`

## Pipeline B — NLP/CV (10 datasets)

1. Run DVM-AD:
   - `python scripts/nlp_cv/run/run_dvmad_realworld_data_nlp_cv.py`
2. Aggregate to 10 datasets:
   - `python scripts/nlp_cv/process/aggregate_nlp_cv_10_datasets.py`
3. Run baselines (per sub-dataset results; file name kept for compatibility):
   - `python scripts/nlp_cv/run/run_baseline_realworld_data_nlp_cv.py`
4. Copy final 10 CSVs:
   - `python scripts/nlp_cv/process/build_final_10.py`

Outputs:
- Raw: `process_nlp_cv/raw_data/dvmad_result_realworld_data.csv`
- Aggregated: `process_nlp_cv/processed_results/dvmad_result_nlp_cv_10_datasets.csv`
- Baseline: `process_nlp_cv/processed_results/Baseline_result_nlp_cv_10_datasets.csv`
- Final: `final_results/results/*_nlp_cv_10_datasets.csv`

Notes:
- Baseline NLP/CV results are **aggregated to 10 groups during ranking** in `scripts/drawing/Ranking.py` (it maps sub-datasets to the 10 groups and averages per model).

## Pipeline C — Synthetic (cluster/global/dependency/local)

1. Generate synthetic datasets (from `Data/Classical/*.npz`):
   - `python scripts/synthetic_data/create_synthetic_data.py`
2. Run DVM-AD on synthetic data:
   - `python scripts/tabular/run/run_dvmad_synthetic_data.py`
3. Run synthetic baselines:
   - `python scripts/tabular/run/run_baseline_synthetic_data.py`
4. Select best per dataset:
   - `python scripts/tabular/process/Handle_data.py`
5. Rank results:
   - `python scripts/tabular/process/Ranking.py`

Outputs:
- Synthetic data: `Data/Synthetic_Datasets/<type>_outliers_datasets/*_X.csv` + `*_y.csv`
- DVM-AD: `process_tabular/raw_data/dvmad_result_<type>_synthetic_data.csv`
- Baseline: `process_tabular/raw_data/Baseline_synthentic_<type>.csv`
- Best-per-dataset: `process_tabular/processed_results/dvmad_result_<type>_synthetic_data.csv`

## Optional: Figures + Tables

After final CSVs and rankings exist:

1. Build ranking files for real-world 10/47 results:
   - `python scripts/drawing/Ranking.py`
2. Plot and summarize:

- `python scripts/drawing/Draw_Boxplot.py`
- `python scripts/drawing/Draw_Boxplot_Time.py`
- `python scripts/drawing/Draw_CDD.py`
- `python scripts/drawing/Table_Average.py`

Outputs:
- `final_results/figures/`
- `final_results/tables/`

## Notes

- All scripts auto-resolve the repo root by locating `process_tabular/`.
- If you only need scripts, use the `.py` files.
- You should take the data preprocessing by yourself
- The import path of Model may vary depending on the OS. Please fix this error on your own if you face this error.
