# Agent Mapping for Production Run 2

This file maps the internal agent names (as they appear in the research runs) to their production names with pass prefixes.

## Pass 0 Agents (from parallel_agent_20250907_195056)

| Production Name | Internal Name | Source Iteration | Performance |
|-----------------|---------------|------------------|-------------|
| pass0_precision_tools_orchestrator | iter2_precision_tools_orchestrator | Iteration 2 | 48.2% mean |
| pass0_template_miner_agent | iter3_template_miner_agent | Iteration 3 | 51.2% |



## Dev Set Evaluations

| Agent | Evaluation Run | Performance | Model | Cost | Notes |
|-------|---------------|-------------|-------|------|-------|

| minimal_3a | dev_20250908_104440 | 64.9% | Sonnet-4 (both phases) | $3.10 | Surprisingly strong baseline performance |
| pass0_precision_tools_orchestrator | dev_20250908_111818 | 62.8% | Sonnet-4 (both phases) | $3.38 | Pass 0 agent dev validation |
| pass0_template_miner_agent | dev_20250908_115416 | 61.7% | Sonnet-4 (both phases) | $3.27 | Pass 0 agent dev validation |

## Baseline Agent

| Name | Description | Performance |
|------|-------------|-------------|
| minimal_3a | Ultra-lightweight baseline | 60.2% mean (won all iterations in Pass 0) |

## Evolution Strategies Used

### Pass 0 (parallel_agent_20250907_195056)
- **Iteration 2**: use_your_judgment_with_tools → pass0_precision_tools_orchestrator
- **Iteration 3**: use_your_judgment_be_different_with_tools → pass0_template_miner_agent



## 0912 Run ELO Leaders (from robophd_20250912_214422)

Agents that held ELO leadership positions during a 27-iteration run.

**Run ID**: robophd_20250912_214422

| Production Name | Internal Name | Leadership Iterations | Peak ELO | Final Accuracy* | Notes |
|-----------------|---------------|----------------------|----------|-----------------|-------|
| 0912_i03_error_pattern_precision_agent | iter3_error_pattern_precision_agent | 3-5 | 1561 | 64.2% | First evolved leader |
| 0912_i05_prompt_completeness_agent | iter5_prompt_completeness_agent | 6-9 | 1577 | 59.3% | Longest initial streak (4 iterations) |
| 0912_i08_schema_validated_query_composer | iter8_schema_validated_query_composer | 10, 13-19, 21 | 1614 | 86.7%** | Most dominant agent (10 iterations total) |
| 0912_i10_precision_schema_validator | iter10_precision_schema_validator | 11-12, 20 | 1602 | 62.9% | Brief but repeated leadership |
| 0912_i21_precision_cross_pollinator | iter21_precision_cross_pollinator | 22-27 | 1670 | 59.2% | Final leader, highest ELO achieved |

*Final accuracy from last iteration where agent was tested
**Accuracy inflated due to 7 Phase 1 failures in iteration 27 not being counted (actual ~38% when failures included)

### Run Configuration
- **Evaluation Model**: Sonnet-4
- **Analysis Model**: Sonnet-4
- **Evolution Model**: Opus-4.1
- **Databases per iteration**: 13
- **Questions per database**: 40
- **Total iterations**: 27 (ended due to 5-hour limit on iteration 28)

### Key Findings
- iter8_schema_validated_query_composer dominated mid-game (iterations 10-21)
- iter21_precision_cross_pollinator achieved highest ELO (1670) in final iterations
- Phase 1 failures significantly affected accuracy calculations in later iterations
- 5-hour conversation limit interrupted evolution at iteration 28

## Notes

- Both Pass 0 evolved agents underperformed the baseline in the short 3-iteration run
- The baseline agent (minimal_3a) demonstrated surprising effectiveness initially
- The 0912 run showed significant agent evolution with clear leadership progression