# Pollinator: Framework for LLM Routing


## Project Structure

```
Pollinator_Final/
├── Predictor/                            # GNN-IRT Performance Prediction
│   ├── config.yaml                      # Configuration for standard predictor
│   ├── config_semi_supervised.yaml      # Configuration for semi-supervised predictor
│   ├── pollinator_predictor.py          # Main predictor implementation
│   └── pollinator_predicto_semi_supervised.py  # Semi-supervised predictor
├── optimizer/                           # Cost-Quality Optimization
│   ├── config/                          # Configuration files
│   │   └── irtrouter-normalizer.yaml   # Optimizer configuration
│   ├── src/pollinator/                  # Optimizer source code
│   │   ├── data/                        # Data access objects
│   │   ├── optimizer/                   # Optimization algorithms
│   │   └── type.py                      # Type definitions
│   ├── requirements.txt                 # Optimizer dependencies
│   └── README.md                        # Optimizer documentation
├── Data/                                # Dataset directories
│   ├── ID-Data/                         # In-domain datasets
│   ├── MMLU-Pro/                       # MMLU-Pro benchmark
│   ├── OOD-Data/                       # Out-of-domain datasets
│   └── ToolCall/                       # Tool calling datasets
├── requirements.txt                     # Main dependencies
└── README.md                           # This file
```

## Key Features

### Pollinator Predictor
The predictor component uses GNN-IRT to predict LLM performance:

## Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd Pollinator
```

2. Install main dependencies:
```bash
pip install -r requirements.txt
```

3. Install optimizer dependencies:
```bash
cd optimizer
pip install -r requirements.txt
cd ..
```

## Data Requirements

### Predictor Data
The predictor expects the following data files in a `../Data/` directory:

- `train_data.csv`: Training questions with question_id and question text
- `test_data.csv`: Test questions with question_id and question text  
- `model_responses.csv`: Binary performance matrix (questions × models)
- `model_costs_and_tokens.csv`: Cost information for each model

### Optimizer Data
The optimizer expects normalized data files:
- Cost matrices with LLM provider costs
- Quality matrices with performance scores
- Reference allocation strategies

## Configuration

### Predictor Configuration
The predictor uses YAML configuration files with the following key sections:

#### Model Configuration
```yaml
model:
  hidden_dim: 64        # GNN hidden dimension
  theta_dim: 16         # IRT theta dimension
  dropout: 0.3          # Dropout rate
```

#### Training Configuration
```yaml
training:
  epochs: 200           # Number of training epochs
  learning_rate: 1e-3   # Learning rate
  weight_decay: 1e-5    # L2 regularization
```

#### Graph Configuration
```yaml
graph:
  k_neighbors: 3        # Number of nearest neighbors
  metric: "cosine"      # Similarity metric
```

### Optimizer Configuration
The optimizer uses YAML configuration for data paths and provider costs:

#### Data Paths
```yaml
cost:
  input_path: /path/to/cost-wide.csv
  output_path: /path/to/cost.csv
quality:
  input_path: /path/to/quality-wide.csv
  output_path: /path/to/quality.csv
```


## Usage

### Pollinator Predictor

#### Standard GNN-IRT Prediction
```bash
cd Predictor
python pollinator_predictor.py
```

This implementation:
- Trains on all available training data
- Uses edge weights in graph convolutions
- Saves model parameters and predictions

#### Semi-supervised GNN-IRT Prediction
```bash
cd Predictor
python pollinator_predicto_semi_supervised.py
```

This implementation:
- Applies masking to reduce training data
- Calculates cost savings from reduced annotation
- Provides detailed masking statistics

### Pollinator Optimizer

#### Data Normalization
```bash
cd optimizer/src
python -m pollinator.data.normalizer.irtrouter_normalizer
```

#### Cost-Quality Optimization
```bash
cd optimizer/src
export RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION=0.1
python -m pollinator.optimizer.cost_quality_batch_optimizer
```

This implementation:
- Performs convex optimization for LLM selection
- Enforces quality constraints while minimizing costs
- Provides optimal allocation strategies

## Outputs

### Predictor Outputs

#### Predictions
- `predicted_llm_performance_*.csv`: Model performance predictions for test questions
- Binary predictions (0/1) for each model on each test question

#### Model Parameters
- `a_train_*.csv`: Question discrimination parameters (a_i)
- `b_train_*.csv`: Question difficulty parameters (b_i)  
- `theta_*.csv`: Model ability parameters (θ)
- `a_test_*.csv`, `b_test_*.csv`: Test question parameters

#### Analysis Files
- `graph_metrics.txt`: Graph structure statistics
- `cost_saving_*.txt`: Cost analysis for semi-supervised approach
- `train_mask_info_*.csv`: Masking information

## Key Parameters

| Parameter | Description | Default | Range |
|-----------|-------------|---------|-------|
| `k_neighbors` | Graph connectivity | 3 | 1-10 |
| `hidden_dim` | GNN hidden size | 64 | 32-128 |
| `theta_dim` | IRT dimension | 16 | 8-32 |
| `mask_ratio` | Semi-supervised masking | 0.9 | 0.1-0.9 |
| `learning_rate` | Optimizer learning rate | 1e-3 | 1e-4 to 1e-2 |



