# Causal Residual Data Augmentation for Regression (CRDA)

*A novel data augmentation methodology that improves regression model performance by generating informed synthetic training examples through residual-guided feature perturbation.*

---

## ⚙️ Setup

### Install Dependencies
```bash
python3 -m venv .venv
source .venv/bin/activate        # or .venv\Scripts\Activate on Windows
pip install -r requirements.txt
```

### System Dependencies (Graphviz for causal graph visualization)
```bash
# macOS
brew install graphviz

# Ubuntu/Debian  
sudo apt-get install graphviz
```

### Getting Started (Demo)
To verify your installation and see CRDA in action:

```bash
# Launch interactive demonstration
cd demo/
jupyter notebook demonstration.ipynb
```

The notebook provides a step-by-step walkthrough of the CRDA methodology with visualizations and expected runtime of ~1 minute.

---

## 🔧 Environment & Dependencies

- **OS**: macOS, Linux, Windows
- **Python**: 3.11.7 (tested and recommended)

### Dependencies
Core packages include:
- `scikit-learn==1.7.0rc1`: Machine learning models and evaluation
- `torch==2.7.0`: Neural networks (MLP baseline)
- `xgboost==3.0.1`: Gradient boosting baseline
- `causal-learn==0.1.4.1`: Causal discovery algorithms
- `optuna==4.3.0`: Hyperparameter optimization
- `pandas`, `numpy`, `matplotlib`: Data manipulation and visualization

Full dependencies listed in `requirements.txt`.

---

## 📊 Datasets

### Included Datasets
All datasets are preprocessed and ready-to-use in `./data/`:

| Dataset | Size | Features | Task | File |
|---------|------|----------|------|------|
| Energy Efficiency | 768 × 8 | Building characteristics | Energy load prediction | `EnergyEfficiency.csv` |
| House Price | 1000 × 13 | Property features | Price prediction | `HousePrice.csv` |
| Wine Quality | 5320 × 11 | Physicochemical properties | Quality score prediction | `WineQuality.csv` |
| Concrete Strength | 1006 × 8 | Material composition | Compressive strength | `ConcreteCompressiveStrength.csv` |
| Parkinson's Monitoring | 5875 × 16 | Voice measurements | UPDRS score prediction | `ParkinsonsTelemonitoring.csv` |
| CPU Performance | 8192 × 21 | Hardware specs | Performance prediction | `227_cpu_small.csv` |
| Satellite Image | 6435 × 36 | Image features | Target variable prediction | `294_satellite_image.csv` |
| Wind Power | 6574 × 14 | Weather conditions | Power output prediction | `503_wind.csv` |
| Synthetic Regression | 1000 × 10 | Generated features | Synthetic target | `623_fri_c4_1000_10.csv` |

### Dataset Format
All datasets are CSV files with:
- Last column: target variable (continuous)
- All other columns: features (numerical)
- No missing values or categorical variables

---

### Output Directory Structure
```
experiments/
├── {baseline}_{timestamp}/
│   ├── config.json              # experiment parameters
│   ├── results.csv              # aggregated metrics across seeds
│   ├── interim_results/         # per-dataset and per-seed results
│   ├── models/                  # trained model artifacts (if save_models=True)
│   └── params/                  # optimized hyperparameters (if save_params=True)
```

---

## 📈 Evaluation / Inference

### Metrics Computation
Results are automatically computed in the experiment. Key metrics:
- `mse`: Baseline model test MSE
- `aug_mse`: Augmented model test MSE  
- `delta_mse`: Percent improvement (negative = better)
- `p_wilcoxon`: Statistical significance (Wilcoxon signed-rank test)

### Reproduce Paper Figures and Results
```bash
# Generate sensitivity analysis plots
python scripts/knob_sensitivity.py

# Collect results across all experiments (produces csvs)
python scripts/collect_results.py

# Collect data generation baseline results (produces csvs in ./experiments_data_gen_baselines)
python scripts/collect_data_gen_results.py

# Statistical significance testing plots
python scripts/p_vals.py
```

---

## 🎯 Results Reproduction

### Core Paper Results
The key results from our paper showing percent improvement (Δ MSE %) over baseline models:

| Dataset | Sample Size | XGB (Δ MSE %) | MLP (Δ MSE %) |
|---------|-------------|---------------|---------------|
| **CPU Performance** | 1638 | -6.99 | **-20.24** |
| | 3276 | -9.47 | **-14.03** |
| | 4914 | -6.20 | **-11.31** |
| | 6552 | -4.13 | **-10.48** |
| | 8190 | -5.19 | **-10.23** |
| **Satellite Image** | 1287 | -4.54 | **-18.36** |
| | 2574 | -3.73 | **-16.69** |
| | 3861 | -4.79 | **-23.14** |
| | 5148 | -4.73 | **-23.72** |
| | 6435 | -5.31 | **-19.66** |
| **Wind Power** | 1314 | -2.82 | **-7.22** |
| | 2628 | 0.20 | **-9.17** |
| | 3942 | -1.33 | **-9.03** |
| | 5256 | -1.40 | **-6.15** |
| | 6570 | -1.08 | **-5.56** |
| **Synthetic Regression** | 200 | -12.00 | **-28.80** |
| | 400 | -3.16 | **-36.93** |
| | 600 | -7.94 | **-27.91** |
| | 800 | -2.23 | **-34.12** |
| | 1000 | -4.59 | **-42.33** |
| **Concrete Strength** | 201 | -8.01 | **-17.80** |
| | 402 | -8.43 | **-19.83** |
| | 603 | -9.75 | **-17.64** |
| | 804 | -15.72 | **-24.77** |
| | 1005 | -12.19 | **-26.90** |
| **Energy Efficiency** | 153 | -13.33 | **-25.10** |
| | 306 | -12.20 | **-28.13** |
| | 459 | -10.55 | **-42.98** |
| | 612 | -19.35 | **-40.71** |
| | 765 | -20.96 | **-28.31** |
| **House Price** | 200 | -14.23 | **-40.57** |
| | 400 | -5.39 | **-37.02** |
| | 600 | -4.87 | **-30.14** |
| | 800 | -9.86 | **-30.32** |
| | 1000 | -6.50 | **-26.97** |
| **Parkinson's Monitoring** | 1175 | -8.40 | **-36.17** |
| | 2350 | -6.60 | **-31.82** |
| | 3525 | -2.79 | **-36.60** |
| | 4700 | -6.26 | **-46.40** |
| | 5875 | 1.65 | **-47.23** |
| **Wine Quality** | 1063 | 0.31 | -0.34 |
| | 2126 | 1.01 | **-5.24** |
| | 3189 | -0.33 | **-3.63** |
| | 4252 | -0.61 | **-4.44** |
| | 5315 | -1.08 | **-4.99** |

**Note**: Negative values indicate improvement over baseline. Bold values highlight substantial improvements (>5%).

### Regenerating Results
To reproduce these exact numbers, run the full reproduction notebook:

```bash
jupyter notebook experiments/full_reproduction.ipynb
```

---

## 🔬 Reproducibility Notes

### Random Seeds
- Primary seed: `random_seed=0` (controls experiment setup)
- Multiple evaluation seeds: `num_seeds=15` (default for statistical rigor)
- Seeds are randomly generated but deterministic given the primary seed
- All random number generators (Python, NumPy, PyTorch) are seeded for reproducibility

### Statistical Robustness
- All results averaged over 15 random seeds
- Wilcoxon signed-rank test for statistical significance (p < 0.05)
- Standard deviations reported for all metrics

---

## 🔍 Method Overview

**CRDA (Causal Residual Data Augmentation)** improves regression models through:

1. **Residual Analysis**: Compute prediction residuals from baseline model
2. **Causal Filtering**: Identify features uncorrelated with residuals and conditionally independent of target
3. **Selective Perturbation**: Perturb filtered features to create interventional data
4. **Counterfactual Targets**: Generate targets using residual patterns  
5. **Augmented Training**: Train new model on combined original + augmented data

### Key Innovation
Unlike traditional augmentation that blindly perturbs features, CRDA uses causal reasoning to select features that can be safely modified without corrupting the underlying data generating process (keeping residuals invariant).

---















