# ABBR Value Experiments

This directory contains experiments to demonstrate the value of **Average Black-Box Ranking (ABBR)** as a metric for evaluating rules against confident black-box model predictions, compared to traditional consistency metrics.

## 📋 Overview

The experiments show that ABBR is more robust than consistency for rule selection by comparing their generalization performance from training to test sets. ABBR leverages the full ranking information from black-box models, while consistency only uses binary thresholded predictions, making it more brittle at high confidence thresholds.

## 🗂️ File Structure

```
abbr_value/
├── README.md                      # This file
├── abbr_experiment_config.py      # Configuration settings for experiments
├── abbr_rule_generator.py         # Rule generation and evaluation utilities  
├── abbr_multi_seed_experiment.py  # Multi-seed experiment runner
├── demo_abbr_experiment.py        # Quick demo script (5 seeds)
├── run_abbr_experiment.py         # Single dataset experiment runner
├── run_all_experiments.py         # Master script for all datasets
├── abbr_value.py                  # Original single-run experiment
└── results/                       # Output directory for results
```

## 🚀 Quick Start

**Important**: Run all commands from within the `abbr_value` directory:

```bash
cd abbr_value
```

### 1. Run Demo (5 seeds, ~30 seconds)
```bash
python demo_abbr_experiment.py
```

### 2. Run Single Dataset Experiment
```bash
# Default settings (FICO, 100 seeds)
python run_abbr_experiment.py --dataset fico

# Custom parameters
python run_abbr_experiment.py --dataset recidivism --seeds 50 --threshold 0.95 --support 0.05
```

### 3. Run All Datasets (Master Script) 🎯
```bash
# Default: All 6 datasets, 100 seeds each
python run_all_experiments.py

# Custom parameters
python run_all_experiments.py --seeds 50 --threshold 0.95 --support 0.05

# Quick test on all datasets
python run_all_experiments.py --seeds 10 --output-dir quick_test_results
```

## 📊 Available Datasets

| Dataset | Domain | Features | Description |
|---------|--------|----------|-------------|
| `recidivism` | Criminal Justice | Mixed | Recidivism prediction |
| `diabetes` | Healthcare | Numeric | Diabetes diagnosis |
| `fico` | Finance | Numeric | Credit scoring |
| `schizo` | Healthcare | Mixed | Schizophrenia diagnosis |  
| `adults` | Demographics | Mixed | Income prediction |
| `readmission` | Healthcare | Mixed | Hospital readmission |

## ⚙️ Key Parameters

| Parameter | Description | Default | Recommended Range |
|-----------|-------------|---------|-------------------|
| `threshold` | Confidence threshold (e.g., 0.9 = top 10%) | 0.9 | 0.8 - 0.99 |
| `support` | Minimum rule coverage | 0.1 | 0.03 - 0.2 |
| `max-conditions` | Max conditions per rule | 3 | 2 - 5 |
| `seeds` | Number of random seeds | 100 | 10 - 100 |

## 📈 Parameter Tuning for ABBR Advantage

To maximize ABBR's advantage over consistency:

### High Confidence + Low Support (Recommended)
```bash
python run_all_experiments.py --threshold 0.95 --support 0.05 --seeds 50
```

### Very High Confidence (More Dramatic Effect)
```bash
python run_all_experiments.py --threshold 0.98 --support 0.03 --max-conditions 4
```

### Complex Rules (More Overfitting)
```bash
python run_all_experiments.py --threshold 0.95 --support 0.05 --max-conditions 5
```

## 📁 Output Files

All output files are saved within the `abbr_value` directory structure.

### Individual Dataset Experiments
- `results/abbr_experiment_[DATASET]_[PARAMS]_detailed.csv` - Per-seed results
- `results/abbr_experiment_[DATASET]_[PARAMS]_summary.txt` - Summary statistics

### All Datasets Master Script
- `[output-dir]/abbr_all_datasets_detailed_[TIMESTAMP].csv` - Combined detailed results
- `[output-dir]/abbr_all_datasets_summary_[TIMESTAMP].csv` - Summary per dataset  
- `[output-dir]/abbr_all_datasets_report_[TIMESTAMP].txt` - Comprehensive report

## 🔍 Interpreting Results

### Key Metrics
- **Generalization Gap**: Difference between train and test consistency (lower = better)
- **Gap Difference**: Consistency gap - ABBR gap (positive = ABBR better)
- **Fraction ABBR Better**: Proportion of seeds where ABBR generalizes better

### Success Indicators
- ABBR gap difference > 0 (ABBR has smaller generalization gap)
- Fraction ABBR better > 0.5 (ABBR wins majority of seeds)
- Statistical significance (large enough sample, small std deviations)

## 🔬 Experiment Design

### Methodology
1. **Dataset Split**: 70% train, 30% test
2. **Black-box Model**: Random Forest trained on training set
3. **Rule Generation**: Random rules with specified constraints
4. **Metric Calculation**: ABBR (average ranking) vs Consistency (binary threshold)
5. **Selection**: Top 1 rule by each metric per seed
6. **Evaluation**: Test set consistency for generalization measurement
7. **Aggregation**: Statistics across multiple random seeds

### Statistical Approach
- Multiple random seeds ensure robust statistical conclusions
- Standard deviations quantify uncertainty  
- Fraction comparisons show consistency of advantage
- Cross-dataset validation demonstrates generalizability

## 🛠️ Advanced Usage

### Custom Configuration
Modify `abbr_experiment_config.py` for permanent parameter changes:

```python
@dataclass
class ExperimentConfig:
    confidence_threshold: float = 0.95  # Higher threshold
    min_rule_support: float = 0.05      # Lower support
    max_conditions_per_rule: int = 4    # More complex rules
    num_seeds: int = 50                 # Fewer seeds for testing
```

### Debugging and Analysis
```python
# Feature type analysis (run from parent directory)
import sys
sys.path.append('..')
from abbr_rule_generator import analyze_feature_types
from datasets import FICO

dataset = FICO()
X = dataset.get_X_train()
analyze_feature_types(X, verbose=True)
```

### Custom Dataset Integration
Add new datasets to `abbr_experiment_config.py`:

```python
AVAILABLE_DATASETS = {
    'custom': CustomDataset,  # Your dataset class
    # ... existing datasets
}
```

## 🎯 Expected Results

Based on theory, ABBR should show advantages when:
- **High confidence thresholds** (0.95+): Consistency becomes unstable
- **Low rule support** (0.05-): More potential for overfitting  
- **Complex rules** (4+ conditions): Increased overfitting risk
- **Noisy datasets**: More opportunities for ranking to help

Typical results show ABBR winning 60-80% of seeds with 0.01-0.05 smaller generalization gaps.

## 🐛 Troubleshooting

### Common Issues
- **No valid rules generated**: Lower `min_rule_support` or increase `max_valid_rules`
- **All experiments fail**: Check dataset availability and dependencies
- **Poor performance**: Try different threshold/support combinations
- **Import errors**: Make sure you're running from the `abbr_value` directory

### Performance Optimization  
- Reduce `num_seeds` for testing (10-20)
- Reduce `num_rules_to_generate` for faster iteration
- Use `--no-save` flag to skip file output during testing

## 📝 Citation

If you use this code, please cite:
```
[Your Paper Title]
[Authors]
[Conference/Journal, Year]
```

## 🤝 Contributing

1. Follow existing code style and documentation
2. Add comprehensive tests for new features
3. Update README for new functionality
4. Ensure backwards compatibility 