# Multi-Problem Algorithm Selector System

A comprehensive system for training algorithm selectors across multiple constraint programming problems with different optimization objectives (minimization, maximization, satisfiability).

## Overview

This system handles the complexities of multi-problem algorithm selection including:
- **Different optimization criteria** per problem type (minimization, maximization, satisfiability)
- **Proper timeout and failure handling** (1000000000, -1000000000, ≥1199s including near-timeout values)
- **Problem-wise stratified splitting** ensuring each problem has test instances
- **Multiple selector algorithms** (Random Forest, AutoSklearn, AutoSklearn Conservative)
- **Comprehensive evaluation** with problem-type aware metrics

## System Components

### 1. Dataset Preparation (`dataset_preparation.py`)
Combines performance and feature tables, creates best solver labels, and performs stratified splitting.

**Key Features:**
- Handles three problem types with different success/failure criteria
- Ensures each problem has at least one test instance
- Creates synchronized 70:30 train/test splits
- Maintains perfect alignment between performance and feature data

### 2. Multi-Problem Selector (`multi_problem_selector.py`)
Trains algorithm selectors using Random Forest or AutoSklearn with multi-problem optimization.

**Supported Selectors:**
- **Random Forest**: Fast, interpretable baseline (500 trees, balanced classes)
- **AutoSklearn**: Automated ML with hyperparameter optimization
- **AutoSklearn Conservative**: Conservative settings for large datasets

### 3. Complete Pipeline (`run_all_selectors.py`)
Runs the entire pipeline from dataset preparation through comparison of all selectors.

## Requirements

### Input Data Structure

#### Performance Tables Directory
```
performance_table/
├── minimization_performance_table.csv    # Problem,Instance,cp-sat,cplex,gecode,gurobi,scip
├── maximization_performance_table.csv    # Problem,Instance,cp-sat,cplex,gecode,gurobi,scip  
└── satisfiability_performance_table.csv  # Problem,Instance,cp-sat,cplex,gecode,gurobi,scip
```

#### Feature Tables Directory
```
LLM_feature_table/
├── minimization_feature_table.csv    # Problem,Instance,feature1,feature2,...
├── maximization_feature_table.csv    # Problem,Instance,feature1,feature2,...
└── satisfiability_feature_table.csv  # Problem,Instance,feature1,feature2,...
```

### Problem-Specific Criteria

**Minimization Problems:**
- Lower objective values are better
- `1000000000` indicates solver failure
- Best solver = minimum valid objective value

**Maximization Problems:**
- Higher objective values are better  
- `-1000000000` indicates solver failure
- Best solver = maximum valid objective value

**Satisfiability Problems:**
- Lower solving time is better
- `≥1199s` indicates timeout (includes values like 1199.88 which are essentially timeouts)
- Best solver = minimum solving time (if any solver succeeds)

## Usage

### Quick Start (Complete Pipeline)

```bash
# Run complete pipeline with default settings
python run_all_selectors.py \
    --performance-dir /path/to/performance_table \
    --feature-dir /path/to/LLM_feature_table \
    --output-dir results

# Custom settings with specific selectors
python run_all_selectors.py \
    --performance-dir /path/to/performance_table \
    --feature-dir /path/to/LLM_feature_table \
    --output-dir results \
    --selectors random_forest autosklearn \
    --autosklearn-time 600 \
    --test-ratio 0.3
```

### Step-by-Step Usage

#### Step 1: Dataset Preparation
```bash
python dataset_preparation.py \
    --performance-dir /path/to/performance_table \
    --feature-dir /path/to/LLM_feature_table \
    --output-dir prepared_data \
    --test-ratio 0.3 \
    --random-seed 42
```

This creates:
```
prepared_data/
├── performance_train.csv  # Training performance with BestSolver labels
├── performance_test.csv   # Test performance with BestSolver labels
├── features_train.csv     # Training features aligned with performance
└── features_test.csv      # Test features aligned with performance
```

#### Step 2: Train Individual Selectors
```bash
# Random Forest
python multi_problem_selector.py \
    --data-dir prepared_data \
    --selector-type random_forest

# AutoSklearn (if available)
python multi_problem_selector.py \
    --data-dir prepared_data \
    --selector-type autosklearn \
    --autosklearn-time 300

# AutoSklearn Conservative (for large datasets)
python multi_problem_selector.py \
    --data-dir prepared_data \
    --selector-type autosklearn_conservative \
    --autosklearn-time 600
```

## Output Structure

After running the complete pipeline, the output directory contains:

```
results/
├── performance_train.csv                      # Prepared training performance data
├── performance_test.csv                       # Prepared test performance data  
├── features_train.csv                         # Prepared training features
├── features_test.csv                          # Prepared test features
├── selector_random_forest_results.pkl         # Random Forest results
├── selector_autosklearn_results.pkl          # AutoSklearn results (if available)
├── selector_autosklearn_conservative_results.pkl  # Conservative results (if available)
├── selector_comparison.csv                    # Comparison of all selectors
└── visualizations/                           # Performance visualizations
    ├── confusion_matrix_random_forest.png
    ├── solver_distribution_random_forest.png
    └── ...
```

## Key Features

### Problem-Aware Dataset Splitting
- Ensures each individual problem has at least one test instance
- Maintains representative problem type distribution in train/test splits
- Uses stratified sampling based on problem names

### Robust Best Solver Determination
- **Minimization**: Selects solver with lowest objective value (excluding failures)
- **Maximization**: Selects solver with highest objective value (excluding failures) 
- **Satisfiability**: Selects solver with shortest time (excluding timeouts)
- Handles cases where all solvers fail/timeout gracefully

### Comprehensive Evaluation
- Overall accuracy across all problem types
- Problem-type specific accuracy (minimization, maximization, satisfiability)
- Individual problem accuracy analysis
- Solver selection distribution analysis
- Cross-validation scores (for Random Forest)

### Advanced AutoSklearn Integration
- Automatic hyperparameter optimization
- Ensemble model selection
- Conservative settings for large datasets (>10k instances)
- Handles memory and time constraints appropriately

## Performance Considerations

### Dataset Size Recommendations
- **Random Forest**: Works well with any dataset size, very fast training
- **AutoSklearn**: Recommended for <5k instances with default settings
- **AutoSklearn Conservative**: Recommended for 5k-50k instances

### Time Budgets
- **Random Forest**: Typically trains in <60 seconds
- **AutoSklearn**: 300-600 seconds recommended for good results
- **AutoSklearn Conservative**: 600-1200 seconds for large datasets

## Dependencies

### Required
- Python 3.7+
- pandas
- numpy  
- scikit-learn
- matplotlib
- seaborn

### Optional
- auto-sklearn (for AutoSklearn selectors)

Install AutoSklearn:
```bash
pip install auto-sklearn
```

## Baseline Comparison

The system automatically computes and compares against standard algorithm selection baselines:

- **Single Best Solver (SBS)**: The solver that wins most frequently on the training set
- **Virtual Best Solver (VBS)**: Oracle that always picks the correct best solver (accuracy = 1.0)

### Standalone Baseline Analysis

```bash
# Compute baselines only (no selector training)
python compute_baselines.py --data-dir prepared_data

# Save to custom file
python compute_baselines.py --data-dir prepared_data --output-file my_baselines.csv
```

## Example Results

Typical output from the comparison:
```
MULTI-PROBLEM ALGORITHM SELECTOR COMPARISON WITH BASELINES
=========================================================

Baseline Performances:
Single Best Solver (SBS): gurobi - 0.6234
Virtual Best Solver (VBS): Oracle - 1.0000

Algorithm Selector Comparison:
Selector                  Train Acc  Test Acc   vs SBS     vs VBS Gap   CV Mean    Time (s)  
----------------------------------------------------------------------------------------------------------
Random Forest            0.8524     0.8367     +0.2133    0.1633       0.8456     42.3      
Autosklearn              0.8734     0.8521     +0.2287    0.1479       N/A        287.6     
Autosklearn Conservative 0.8698     0.8489     +0.2255    0.1511       N/A        453.2     

Best Performers:
  Highest Training Accuracy: Autosklearn (0.8734)
  Highest Test Accuracy: Autosklearn (0.8521)
  Fastest Training: Random Forest (42.3s)

Baseline Analysis:
  Best selector improvement over SBS: +0.2287 (+36.7%)
  Best selector gap to VBS: 0.1479 (14.8% of oracle)
  Selectors beating SBS: 3/3

Dataset Information:
  Training Instances: 8467
  Test Instances: 3629
  Features: 26
```

## Troubleshooting

### Common Issues

1. **"No common instances found"**
   - Check that Problem and Instance columns match between performance and feature files
   - Verify file naming conventions are consistent

2. **"AutoSklearn not available"**
   - Install with `pip install auto-sklearn`
   - System will automatically fall back to Random Forest

3. **"Problems without test instances"**
   - Increase test ratio or reduce minimum test instances per problem
   - Some problems might have only 1 instance total

4. **Memory issues with large datasets**
   - Use AutoSklearn Conservative settings
   - Reduce AutoSklearn time budget
   - Consider feature selection

### Performance Tips

1. **For large datasets (>10k instances)**:
   - Use `autosklearn_conservative` selector
   - Increase time budget to 900-1200 seconds
   - Consider reducing feature dimensionality

2. **For quick experiments**:
   - Use `random_forest` selector only
   - Reduce test ratio to 0.2 for more training data

3. **For maximum accuracy**:
   - Train all three selector types
   - Use longer AutoSklearn time budgets (600-1200s)
   - Ensure high-quality features with good coverage

## Citation

If you use this system in your research, please cite:

```
Multi-Problem Algorithm Selector System
Generated for Constraint Programming Algorithm Selection Research
https://github.com/your-repository/LLM_Generic_Extractor
```