# Feature Visualization Toolkit

This directory contains visualization tools for analyzing feature importance and ablation studies in the LLM vs mzn2feat research.

## Directory Structure

```
src/visualization/
├── feature_visualization_toolkit.py  # Main toolkit
├── figures/                          # Generated PDF outputs  
└── README.md                         # This documentation
```

## Quick Start

```bash
cd src/visualization/

# Basic usage - generates all available visualizations
python feature_visualization_toolkit.py --problem FLECC --selector random_forest

# Different selector types
python feature_visualization_toolkit.py --problem car_sequencing --selector autosklearn
python feature_visualization_toolkit.py --problem vrp --selector autosklearn_conservative

# Different loss functions
python feature_visualization_toolkit.py --problem FLECC --selector random_forest --loss-function ranking
```

## Generated Visualizations

### 1. Feature Importance Comparison (`*_feature_importance_comparison.pdf`)
- **Side-by-side bar charts** comparing top 20 features
- **Left panel**: mzn2feat features (orange bars)
- **Right panel**: LLM features (blue bars) 
- Shows which feature types are most predictive for algorithm selection

### 2. Feature Importance Heatmap (`*_importance_heatmap.pdf`)
- **Color-coded heatmaps** showing normalized importance scores
- **Top panel**: mzn2feat feature importance
- **Bottom panel**: LLM feature importance
- Easy identification of consistently important features

### 3. Feature Correlation Matrix (`*_correlation_matrix.pdf`) 
- **Correlation heatmaps** for interpretability analysis
- **Left panel**: mzn2feat feature correlations
- **Right panel**: LLM feature correlations
- Shows feature redundancy and semantic groupings

### 4. Model-Based Feature Analysis (Console Output)
- **Model architecture comparison** (RandomForest vs AutoSklearn)
- **Feature utilization efficiency** (% of features used effectively)
- **Top most important features** from trained models
- **Feature concentration analysis** (importance distribution)

### 5. Model Comparison Visualization (`*_model_analysis.pdf`)
- **Feature importance distributions** (histograms)
- **Cumulative importance curves** (feature efficiency)
- **Top-K feature analysis** (concentration patterns)
- **Model architecture insights**

**Note**: Feature ablation studies are not included as they would require reconstructing performance data, which is less accurate than using the trained model feature importance directly from .pkl files.

### Feature Importance Definition
**Random Forest Feature Importance**: Gini impurity-based importance measuring how much each feature contributes to algorithm selection decisions. Values range 0-1 (higher = more important), sum to 1.0. Example: 0.1 = feature contributes 10% to decision process.

## Example Output

After running:
```bash
python feature_visualization_toolkit.py --problem FLECC --selector random_forest
```

You get:
```
Loaded data for FLECC:
  mzn2feat: 95 features
  LLM (lmtuner20250908124149): 50 features
Loaded models: mzn2feat=True, LLM=True

Generating visualizations for FLECC - random_forest
============================================================
1. Creating feature importance comparison...     ✓ Success
2. Creating feature importance heatmap...        ✓ Success  
3. Performing feature ablation study...          ✓ Success (Skipped - using model analysis)
4. Creating feature correlation matrices...      ✓ Success
5. Analyzing model-based feature usage...        ✓ Success
6. Creating model comparison visualization...     ✓ Success

📊 CORRELATION ANALYSIS RESULTS:
==================================================
mzn2feat Features: Mean |correlation|: 0.330
LLM Features: Mean |correlation|: 0.306
🎯 INTERPRETATION: LLM features show 7.5% lower correlation

🔍 MODEL-BASED FEATURE ANALYSIS:
============================================================
Feature Utilization:
  mzn2feat: 54/95 features used effectively (56.8%)
  LLM: 43/50 features used effectively (86.0%)

Visualizations saved to: figures/
Files generated:
  - FLECC_random_forest_correlation_matrix.pdf
  - FLECC_random_forest_feature_importance_comparison.pdf
  - FLECC_random_forest_importance_heatmap.pdf
  - FLECC_random_forest_model_analysis.pdf
```

## Research Insights from Visualizations

### Feature Quality Analysis
- **LLM features** show more diverse, semantic characteristics
- **mzn2feat features** cluster around statistical measures
- Correlation matrices reveal feature redundancy patterns

### Algorithm Selection Effectiveness  
- Feature importance comparisons show which extractor produces more predictive features
- LLM features often achieve better performance with fewer total features
- Heatmaps identify problem-specific vs generalizable features

### Publication-Ready Output
- All plots saved as high-resolution PDFs (300 DPI)
- Consistent styling and formatting
- Clear legends and labels for paper inclusion

## Batch Processing

Generate visualizations for all problems and selectors:

```bash
#!/bin/bash
for problem in FLECC car_sequencing vrp; do
    for selector in random_forest autosklearn autosklearn_conservative; do
        echo "Processing $problem - $selector"
        python feature_visualization_toolkit.py --problem $problem --selector $selector
    done
done
```

## Dependencies

- pandas, numpy
- matplotlib, seaborn
- scikit-learn
- **Required**: Trained model files (.pkl) in `../../results_accuracy/` or `../../results_ranking/`
- **Required**: Feature datasets in `../datasets/` (features_train.csv, features_test.csv)

**Note**: Only feature data and trained models are used - no performance data reconstruction needed for accuracy.

## Troubleshooting

**Common Issues:**
1. **Missing model files**: Ensure trained models exist in `../../results_accuracy/`
2. **Missing datasets**: Check that datasets exist in `../datasets/`
3. **Matplotlib backend**: If running on server, may need to set `MPLBACKEND=Agg`
4. **Memory issues**: Large correlation matrices may require more RAM

**Solutions:**
- The toolkit includes error handling - failed visualizations are skipped
- Individual visualization methods can be called separately if needed
- Check file paths match your directory structure